Monday, April 13, 2015

Binning Aggregates for Histogram and Aggregate Distributions

There are two aggregates in the title. One is for creating bins. The other is for calculating distribution over bins of aggregate functions like sum, average, count etc. In case of count distribution, it is also known as histogram.

In Tableau, there is a menu function for creating bins for any numerical dimension or measure. The bins becomes a new dimension and is mostly useful for creating histogram and the like distributions. Each bin thus created is of equal size. Bins are mutually exclusive, which means that each record of data belongs to one bin only.

In practice, we may need to create bins from an aggregate measure and of non-uniform sizes. And then we need to compute the relevant aggregation over every bin. The purpose is to slice and dice measures from a number of bins. This is a recurring question that appears in various forms, such as this one and many more from the tableau forum:
http://community.tableau.com/message/355082
http://community.tableau.com/thread/160209
http://community.tableau.com/thread/160083
http://community.tableau.com/message/346692

We are going to review the steps for solving the problem. There are 3 simple steps.

1.Labeling bins
2.Windowing aggregate computation
3.Filtering results

Let's apply the approach to a use case from the Tableau forum. The problem statement is: Create bins on the result of Pareto analysis in a 80:15:5 proportion. Pareto analysis is a running sum (in percentage) along a sorted dimension.

In the example, one aggregate is used in creating bins. In this example, the aggregate measure [Cumulative Share of Sales] is used. It is computed as follows:

- Sort [Product Name] by [Sales] in descending order
- Calculate the running % sum of Sales

In practice, the aggregate for bins can be anything, not necessarily the same as this one.

Step 1. Labeling the bins

Then we use this aggregate [Cumulative Share of Sales] to create bins [ABC Bins] and label them as follows:

IF  [Cumulative Share of Sales] <= 0.8 THEN "A.Top 80%"
ELSEIF  [Cumulative Share of Sales] <= 0.95 THEN "B.80%-95%"
ELSE "C.Bottom 5%"
END

We get 3 bins and each has a label.

Step 2. Windowing the aggregate computation

For each bin, we calculate the aggregated sales within that window [Windowed_Sales]:

Case [ABC Bins]
When 'A.Top 80%'   then WINDOW_SUM(if [ABC Bins]='A.Top 80%'   then SUM([Sales]) END)
When 'B.80%-95%'   then WINDOW_SUM(if [ABC Bins]='B.80%-95%'   then SUM([Sales]) END)
When 'C.Bottom 5%' then WINDOW_SUM(if [ABC Bins]='C.Bottom 5%' then SUM([Sales]) END)
END

Note that Window_Sum() function is not affected by the bin labels in Case-When clause. To get the result right, we need to condition the sum within Window_Sum(). We are windowing the sum for each bin actually, and adding up rows having the same label.

The windowed result is one per row. However we just need one per bin. Next we are going to fix that.

Step 3. Filtering the result

As we mentioned above, we just need one aggregation result per bin and filter the rest. Thus here is a calculated field [Binned_Sales] for that:

IF First()=0 OR [ABC Bins]!=lookup([ABC Bins],-1)
THEN [Windowed_Sales]
ELSE NULL
END

This calculated field will give us one result per bin and make the rest of rows all nulls. The following picture shows the relationship between [ABC Bins], [Windowed_Sales], [Binned_Sales], [Sales] and [Cumulative Share of Sales].
By dragging [Binned_Sales] to the filter shelf, setting it to compute using [Product Name] and selecting the option Special>Non-null values, we filtered out all the null rows.
Because the rows are sorted according to [Sales] in descending order, the first row in each bin will have the result of [Windowed_Sales] and all the rest of rows are nulls. The first row of each bin is calculated by this logic:
  • First()=0 OR [ABC Bins]!=lookup([ABC Bins],-1) 
First()=0 is the first row in the first bin. The second part of the logic is for the first row in all other bins.

Clean up

Move the aggregate [Cumulative Share of Sales] and [Product Name] to the detail shelf . These are the dimensions we need to keep. The only dimension we need on the Rows shelf is [ABC Bins]. And last, put [Binned_Sales] on the Label Shelf.

Here is the workbook you can download for the above example.

There is another use case where the problem statement is: Show sum of sales for Top 10 Customers, Top 11-20, OTHERS. The same approach can be used to solve the problem. The workbook can be found following the above link.

3 comments:

  1. Hi Alexander thanks for this interesting solution.
    I am trying to recreate your viz from scratch and I'm doing fine until I get to the point when I move products to detail mark: the numbers in the table appear then as scrambled and the only way to fix it is to turn stack marks off from the analysis menu. Any suggestions on this? I feel as I'm missing something basic.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. Sent the answer to you. See if it works for you.

      Delete