Thursday, May 21, 2015

Histogram via Size()

While doing research on "Ignore in Table Calculations", I stumbled upon a post by Tableau's Patrick Van Der Hyde in which he mentioned Size() for histogram. This reminded me of Tableau Zen Master Joshua Mulligan who also used Size() for histogram in a different way. Both solutions inspired me to explore all the options on histogram using Size(). I am a histogram junkie after all.

There seems always yet another way of doing things in Tableau, possibly a better way. (How versatile Tableau is!) Here I am armed with Size() and give histogram one more shot, following some early efforts.

I would use the same Superstore example where we count the number of customers per number of orders. There are a few different approaches to histogram via Size(). In the process, we can gain deep insights into Tableau's working mechanism.

Stack Marks On Approach

The steps are:

Step 1.Drag [Customer Name] to Details shelf, a dimension needed to count the # of orders per customer.
Step 2.Drag Sum([Number of Records]) to Columns shelf and turn it into Discrete.
Step 3.Create a calc field [Size]: Size() and drag it to the Rows shelf.
Voila it takes only 3 steps to build a barebone histogram! 

Further cleanup can be done as described in an earlier article:

Step 4.Adding labels to the bar tops through reference lines.
Step 5.Removing the border of the bricks to make a solid color bar.
Step 6.Right click unneeded axis annotations to hide them.

Each customer is represented by a brick, or a mark. In the lower-left corner, we see there are 795 marks in this chart, which is the total number of customers. All marks are stacked up. From the menu, we see that Analysis>Stack Marks is ON by default. (If you check OFF, all the marks will overlap each other and the bar height will become 1.)
Size() is actually a table calculation that operates on multiple rows. The partition scope is per # of records. The default addressing is table-across. We do not need to do any setup though.

Stack Marks Off Approach

In the previous approach, the size of each mark is 1. Actually, we can increase the size of each mark to the number of marks in a partition, that is, Size().

Let's turn off Stack Marks. Then all marks in the same partition will overlap each other. That means we only see one mark per partition. The other marks are hidden behind it. At the lower-left corner, we see that the total of marks is 795.

The steps are:

Step 1.Drag [Customer Name] to Details shelf, a dimension needed to count the # of orders per customer.
Step 2.Drag Sum([Number of Records]) to Columns shelf and turn it into discrete.
Step 3.Right click Sum([Number of Records]) and uncheck "Ignore in Table Calculations". This turns Sum([Number of Records]) into a true dimension. Not all blue pills are dimension to the marks!
Step 4.Create a calc field [Size]: Size() and drag it to Rows shelf.
Step 5.Set [Size] to Compute Using>Customer Name
Step 6.Set menu Analysis>Stack Marks to Off
Step 7.Click menu bar button [ABC] to show mark label.
Step 8.Right click unneeded axis annotations to hide them.

By setting Size() to Compute Using>Customer Name, each mark's size will equal the number of customers in a partition. We get solid bars by default.

Basically Size() is equivalent to Window_Sum() or Window_Count() that have been used in other implementations. Size() is still a table calculation but looks the least intimidating.

The above Stack Marks On/Off approaches will leave us way more marks than necessary: 795 in total. For the chart in question, there are 34 bars in total, which would have required 34 marks. That is, one mark per bar.

Stack Marks Off approach requires a couple more steps than the On approach. But it is necessary for the next Minimum Marks solution, which some people may prefer.

Minimum Marks Approach

For the purist or for the sake of better performance (less resources and rendering, I guess), we can filter out the marks that are redundant. The filtered result requires the minimum number of marks to represent the data.

This approach requires a couple additional steps to the Stack Marks Off approach:
- Create a calculated field [First at 0]: First()=0 and drag it to Filters shelf. Click OK at pop up.
- Right click it to select Compute Using>Customer Name. Select True as filter option.
In the above steps, we index the bars in each partition, keep the sole mark at First()=0 and remove the others. This is an Index-N-Filter method that is a must-have tool in dealing with table calculations.

Note that the # of marks is reduced to 34!

Binning the Aggregates 

We can apply further binning to Sum([Number of Records]) in all the above approaches the same way as showed in an earlier article.

Histogram Template

In the design, both Size() and Sum([Number of Records]) are generic functions not specific to any use case. Only [Customer Name] is specific to the data set. Any histogram can be done just by replacing this dimension by the one that counts by the partition of the number of records.

Postscript

Histogram is such an interesting topic where we can learn a great deal of techniques in Tableau. It is an excellent exercise for learning the details of Tableau. I would enumerate them here:

- Size() function
- Table calculation addressing: Compute using >
- Stack marks on and off
- Ignore in Table Calculations
- Indexing and Filtering
- Adding reference lines
- Removing border by setting it to None in Color shelf.
- Labeling
- Label formatting

I thought I had written enough on histogram with 4 articles:
1.Histograms on Aggregated Measures without Table Calculations
2.Binning Aggregates for Histogram and Aggregate Distributions
3.Dynamic Histogram Over Time
4.Histogram via Rank Functions

When I wrote the 3rd article on histogram, I thought that's it. A trilogy on histogram seemed good enough. Here I am, writing the 5th article in the series on histogram. And I would no longer say this is my last one.

The workbook can be downloaded here.

3 comments:

  1. Hi, I just had a long comment on this here: http://community.tableau.com/message/374177#374177. '

    One particular note for people using this technique is that it's not an accurate histogram when the data is sparse (not the missing values between 30 and 35 and 35 and 41 in the last screenshot), the workaround is to use a continuous pill on Columns instead of a discrete pill.

    ReplyDelete
  2. I did notice the issue which deviates a bit from the definition of histogram. For histogram on bin size >1, which is often the case, this is a lesser problem.

    Tool wise, Tableau needs to let user turn on missing values in dimensions based on aggregates.

    ReplyDelete
  3. Jim Wahl pointed out that Sum(Number of Records) is number of items purchased by a customer, not number of orders. My bad. http://community.tableau.com/message/374205#374205

    ReplyDelete