Tuesday, July 21, 2015

Creating Bins via Math Formula

Zen Master Jonathan Drummey left a comment in Creating Bins Made Easy, saying that he is not using "IF ... ELSEIF..." statement when creating bins of equal size. Instead, he just creates a calculated field with some math wizardry. How brilliant! Jonathan's repertoire is infinite. Let's see how it works.

We use the same example as in Creating Bins Made Easy where we need to create bins of equal size on percentiles. The number of bins is 25. And we want to label the bins from 1 to 25. (If your number of bins is N, just replace 25 by N in the following.)

1.Via Ceiling()
The formula for the calc field is
  • Ceiling(Percentile*25) 
Ceiling() is a new function introduced in Tableau 9. However it doesn't work with table calculations. What a pity! Percentile is usually generated using Rank_Percentile() which is a table calculation. This means that the use of Ceiling() is limited.

2.Via IIF()
The good news is, this formula works with table calculation:
  • IIF([Percentile]*25 = INT([Percentile]*25), [Percentile]*25, INT([Percentile]*25)+1)
This is a way to achieve the same as the ceiling function. And it works with table calculations! What it does is, if [Percentile]*25 is an integer, then the bin label (the ceiling) is the integer itself, else it is the floor INT([Percentile]*25) plus one.

Unfortunately, the equation [Percentile]*25 = INT([Percentile]*25) is often not exact due to the low precision calculation in Rank_Percentile(). The generated percentile can be a bit off. For example, a theoretical 20% percentile could be 20.000000001% and thus may go to another bin instead of the expected one.

3.Via IIF() Revised

Theoretically speaking, both Ceiling() and IIF() approaches can work wonders. Due to the precision limitation in Tableau's calculation, the created bins are not as exact as expected. Precision wise, Rank/100 can do better than Rank_Percentile().

To remedy the problem, I revised the equation as

  • INT([Percentile]*25*1000000)/1000000 = INT([Percentile]*25) 

or equivalently

  • INT([Percentile]*25*1000000) = INT([Percentile]*25) *1000000

The trick is to use only the more significant digits of the decimals and remove the noisy bits resulting from the calculation of Rank_Percentile(). When I format the percentile values to have 8+ decimals, the noise starts to show.

So the ceiling formula will be revised as 

  • IIF(INT([Percentile]*25*1000000)/1000000=INT([Percentile]*25), [Percentile]*25, INT([Percentile]*25)+1)

PS. The judicious choice of using 1000000 or 1000 as a factor depends on how tolerant you are to the computing noise. If using 1000000 which will keep the 6 most significant decimals, only something like 20.0000001% will be put in the same bucket as 20%. If using 1000,  20.0001% will be in the same bucket as 20%.

Included is a workbook that you can download and play with.

1 comment:

  1. Aaron Clancy provided a math formula for creating uniform bins on any aggregated measure: 1-10,11-20,21-30,.... http://community.tableau.com/message/303878#303878 (Bin size can be any integer. 10 here is just an example.)

    IF SUM([Number of Records])/10 <1 THEN 0
    ELSE DIV(SUM([Number of Records]),10)*10-10
    END

    SUM([Number of Records]) can be replaced by any aggregated measure.

    ReplyDelete