Sunday, May 10, 2015

Histogram via Rank Functions

In a response to my recent article Dynamic Histogram Over Time, Tableau Zen Master Jonathan Drummey came up with a few amazing tricks from his sophisticated toolbox to offer a couple of alternative solutions. The tricks seem to be the inventions of Joe Mako, another Zen Master. One of the solutions using Tableau's rank functions requires less steps. It reveals an intimate relationship between histogram and rank functions. The solution is however deeply buried in a 2-dimensional dynamic histogram maze.

To illustrate this elegant solution, I am going to apply the new techniques in a simpler context here.

I will use the example from an earlier article, based on a data set published by English Premier League 2014. Regardless of the final league standing of the 20 clubs, the Premier League allocated Facility Fees to each club depending on the number of matches broadcast by UK network. It's an indicator of the popularity of each club.

Problem Statement
There are 7 levels of facility fees allocated to 20 clubs in 2013-2114 season. We would like to know: for each fee level, how many clubs are there? 

This is a very basic histogram question. There are many ways to calculate this. The purpose here is to illustrate the rank based approach, instead of seeking the easiest.

Summary of the Approach
A general approach of Histogram via Rank functions can be summarized as

Step 1. Labeling the bins: [Label]
Step 2. Counting the labels in each bin: (Set table calculation to compute use>Club)
  • Rank_Modified([Label]) - Rank([Label]) + 1
Step 3. Indexing the labels in each bin:
  • Rank_Unique([Label]) - Rank([Label]) + 1
Step 4. Filtering with Index=1 (Set table calculation to compute use>Club)

With the above 4 steps, we can get a simple histogram as follows:
Below we will explain the details of the approach.

1.Labeling Bins
We will use level of fees as bin. Each bin will create a partition among the clubs. Each partition has a number of clubs.

In our case, there are 7 bins. Labels are friendly names for bins. They are easy to reference.

We could have used "IF fees THEN fee level..." to label the levels. Because for different seasons, the fees may change and the number of levels may change, it is not convenient to hard code the numbers. Instead, we will use a calculated field to automatically generate the labels:
  • "Facility Fee Level "+ STR(Rank_Dense([Facility Fees]))
Note that we take advantage of the fact that Rank_Dense([Facility Fees]) provides index for the fee levels in a descending order.
2.Counting Labels per Bin

I am going to describe a genius way of counting the number using Rank() and Rank_Modified(). Instead of counting the clubs per bin, we count the number of labels per bin. The number of labels per bin is equivalent to the number of clubs per bin anyway. The labels are all of the same value in a bin.

The table below shows the comparison between Rank() and Rank_Modified(). 
When there are equal values to rank, a spread is created between the top and the bottom within the same partition using the two different rankings. And the following relationship holds:
  • Spread = Partition Size - 1
where spread = Rank_Modified([Bin Label])-Rank([Bin Label]). 

Thus we get
  • Partition Size = Rank_Modified([Bin Label])-Rank([Bin Label])+1
The partition size is equal to the number of labels in the partition. This provides label count per bin. We will use [Count_via_Rank] to denote this calculated field. Make sure to set table calculation to compute use>Club. 

The relationship between rank and spread are shown below for every fee level and every club.

We see that there are duplicate rows of the same count value per bin. Only one is needed in the expected histogram. Thus we have to filter all but one row per bin.

3.Indexing Rows per Bin 

For filtering the rows within a partition, we need to first index the rows of that partition.

Rank_Unique([Bin Label]) provides a distinct ranking for every label, even for labels of the same value. And it is like an index to all the labels.

The spread between Rank_Unique() and Rank() will provide a way to index the rows within each partition. Spread + 1 will make the index start from 1.

The following formula provides a very convenient way to index the elements within a partition:
  • Rank_Unique([Bin Label]) - Rank([Bin Label]) + 1
Let us have a calc field [Index_via_Rank] for the above formula.

The relationship between Rank(), Rank_Unique() and the spread is shown in the picture below.
4.Filtering 

Drag [Index_via_Rank] to the filter shelf and select the value 1. Make sure to set table calculation to compute use>Club. We will get the following Histogram.
Note that we added colors by Rank_Dense() and Measure Name in extra.

Voila, we get our histogram. The workbook can be found here.

Another example
I built a histogram without table calculations earlier. It is for bar chart only and used some graphical tricks.

Using the via rank approach, I created the same histogram in a few minutes. Here is the workbook for download.


The via rank approach is using table calculation and rank functions. It provides a simple yet elegant way to count and index partitions. It can be applied to a broader arena of applications with or without bar chart, one dimensional or multidimensional histograms.




No comments:

Post a Comment