1. In a recent post, C. Palo has asked for a "top N and others" solution. There is a KB article on the subject. But it's based on sets, which won't work for C. Palo because she has a few filters. So the top N will be dynamic while sets based approach only provides static top N, regardless of filters.

    I would present here a tutorial on the topic. It is based on an approach by Tableau Zen Master Richard Leeke while using a different filter. Also would I explain some of the intricacies. This approach works on filtered results if any.

    The example uses the superstore data set where we will show the top N countries with the highest sales, and group the others in a single bar.

    Step 1. Create an integer parameter [Top N]
    Step 2. Drag [Country / Region] to Details shelf and sort it by Sum(Sales) in descending order.
    Note that all the 4 calculated fields below need to be set to compute using  [Country / Region].

    Step 3. Create an Index: Index()
    Turn it into discrete and drag it to Rows. Set it to compute using [Country / Region]. This index is mainly for the ordering of the results in final presentation.

    Step 4. Create a [Top N+1 Rows] filter:
    • Index()<=[Top N] +1
    Drag it to Filters and set it compute using [Country / Region]. Select True at pop up.

    We only need N+1 rows/bars/marks in the chart. There comes the filter. The (N+1)th row will show "Others" with its aggregated value.

    Step 5. Create [Top Countries & Others]. This will alias the country at the (N+1)th position to be "Others". Drag it to Rows and set it to compute using [Country / Region].
    • If Index()=[Top N] +1
    • Then "Others"
    • Else Attr([Country / Region])
    • End
    Step 6. Create [Top Sales & Others]. At the (N+1)th position, the value is altered to be an aggregated result. Drag it to Columns and set it to compute using [Country / Region].
    • If Index()=[Top N] +1
    • Then Window_Sum(Sum(Sales),0,Last())
    • Else Sum(Sales)
    • End
    Note that in the above formula, the If statement sets the index=0 for the window function which has its own index. The settings here instructs the window function to compute from (N+1)th row to the last row.

    The aggregation at the (N+1)th row can be the average of all the others if you wish:
    • Window_Avg(Sum(Sales),0,Last())
    We are done with a basic "Top N & Others" bar chart with grouping all others in one bar. Note that all table calculations have to compute using [Country / Region].
    Dress up

    Add a few extra steps to make the presentation better.

    - Drag [Top Sales & Others] to Label/Text shelf and set it to compute using  [Country / Region].
    - Create [Others] to color Top N and Others differently. Set it to compute using  [Country / Region].
    • Index()=[Top N]+1
    - In Step 5, add a little calculation to show # of countries in the "Others".
    • If Index()=[Top N] +1
    • Then Str(Size()-[Top N])+" Others
    Here is the final result. The workbook can be downloaded here.
    If you want to have the option of expanding the countries in "Others", you can use a method proposed by another Zen Master Joe Mako.

    Grouping Others per Category

    The above approach can be applied to "Top N and Others" per category. Here is an example based on a question in the Tableau forum. It includes both a bar chart and a cross tab. This is an alternative solution to the one proposed by Zen Master Joe Mako.

    An important change is using Rank_Unique(Sum(Sales)) for indexing, instead of Index(). This is based on an approach described in an early article.

    Make sure all the table calculations in this example to be set to compute using Category.
    Top N & Grouping Others Series
    0

    Add a comment

  2. While doing research on "Ignore in Table Calculations", I stumbled upon a post by Tableau's Patrick Van Der Hyde in which he mentioned Size() for histogram. This reminded me of Tableau Zen Master Joshua Mulligan who also used Size() for histogram in a different way. Both solutions inspired me to explore all the options on histogram using Size(). I am a histogram junkie after all.

    There seems always yet another way of doing things in Tableau, possibly a better way. (How versatile Tableau is!) Here I am armed with Size() and give histogram one more shot, following some early efforts.

    I would use the same Superstore example where we count the number of customers per number of orders. There are a few different approaches to histogram via Size(). In the process, we can gain deep insights into Tableau's working mechanism.

    Stack Marks On Approach

    The steps are:

    Step 1.Drag [Customer Name] to Details shelf, a dimension needed to count the # of orders per customer.
    Step 2.Drag Sum([Number of Records]) to Columns shelf and turn it into Discrete.
    Step 3.Create a calc field [Size]: Size() and drag it to the Rows shelf.
    Voila it takes only 3 steps to build a barebone histogram! 

    Further cleanup can be done as described in an earlier article:

    Step 4.Adding labels to the bar tops through reference lines.
    Step 5.Removing the border of the bricks to make a solid color bar.
    Step 6.Right click unneeded axis annotations to hide them.

    Each customer is represented by a brick, or a mark. In the lower-left corner, we see there are 795 marks in this chart, which is the total number of customers. All marks are stacked up. From the menu, we see that Analysis>Stack Marks is ON by default. (If you check OFF, all the marks will overlap each other and the bar height will become 1.)
    Size() is actually a table calculation that operates on multiple rows. The partition scope is per # of records. The default addressing is table-across. We do not need to do any setup though.

    Stack Marks Off Approach

    In the previous approach, the size of each mark is 1. Actually, we can increase the size of each mark to the number of marks in a partition, that is, Size().

    Let's turn off Stack Marks. Then all marks in the same partition will overlap each other. That means we only see one mark per partition. The other marks are hidden behind it. At the lower-left corner, we see that the total of marks is 795.

    The steps are:

    Step 1.Drag [Customer Name] to Details shelf, a dimension needed to count the # of orders per customer.
    Step 2.Drag Sum([Number of Records]) to Columns shelf and turn it into discrete.
    Step 3.Right click Sum([Number of Records]) and uncheck "Ignore in Table Calculations". This turns Sum([Number of Records]) into a true dimension. Not all blue pills are dimension to the marks!
    Step 4.Create a calc field [Size]: Size() and drag it to Rows shelf.
    Step 5.Set [Size] to Compute Using>Customer Name
    Step 6.Set menu Analysis>Stack Marks to Off
    Step 7.Click menu bar button [ABC] to show mark label.
    Step 8.Right click unneeded axis annotations to hide them.

    By setting Size() to Compute Using>Customer Name, each mark's size will equal the number of customers in a partition. We get solid bars by default.

    Basically Size() is equivalent to Window_Sum() or Window_Count() that have been used in other implementations. Size() is still a table calculation but looks the least intimidating.

    The above Stack Marks On/Off approaches will leave us way more marks than necessary: 795 in total. For the chart in question, there are 34 bars in total, which would have required 34 marks. That is, one mark per bar.

    Stack Marks Off approach requires a couple more steps than the On approach. But it is necessary for the next Minimum Marks solution, which some people may prefer.

    Minimum Marks Approach

    For the purist or for the sake of better performance (less resources and rendering, I guess), we can filter out the marks that are redundant. The filtered result requires the minimum number of marks to represent the data.

    This approach requires a couple additional steps to the Stack Marks Off approach:
    - Create a calculated field [First at 0]: First()=0 and drag it to Filters shelf. Click OK at pop up.
    - Right click it to select Compute Using>Customer Name. Select True as filter option.
    In the above steps, we index the bars in each partition, keep the sole mark at First()=0 and remove the others. This is an Index-N-Filter method that is a must-have tool in dealing with table calculations.

    Note that the # of marks is reduced to 34!

    Binning the Aggregates 

    We can apply further binning to Sum([Number of Records]) in all the above approaches the same way as showed in an earlier article.

    Histogram Template

    In the design, both Size() and Sum([Number of Records]) are generic functions not specific to any use case. Only [Customer Name] is specific to the data set. Any histogram can be done just by replacing this dimension by the one that counts by the partition of the number of records.

    Postscript

    Histogram is such an interesting topic where we can learn a great deal of techniques in Tableau. It is an excellent exercise for learning the details of Tableau. I would enumerate them here:

    - Size() function
    - Table calculation addressing: Compute using >
    - Stack marks on and off
    - Ignore in Table Calculations
    - Indexing and Filtering
    - Adding reference lines
    - Removing border by setting it to None in Color shelf.
    - Labeling
    - Label formatting

    I thought I had written enough on histogram with 4 articles:
    1.Histograms on Aggregated Measures without Table Calculations
    2.Binning Aggregates for Histogram and Aggregate Distributions
    3.Dynamic Histogram Over Time
    4.Histogram via Rank Functions

    When I wrote the 3rd article on histogram, I thought that's it. A trilogy on histogram seemed good enough. Here I am, writing the 5th article in the series on histogram. And I would no longer say this is my last one.

    The workbook can be downloaded here.
    3

    View comments

  3. In a response to my recent article Dynamic Histogram Over Time, Tableau Zen Master Jonathan Drummey came up with a few amazing tricks from his sophisticated toolbox to offer a couple of alternative solutions. The tricks seem to be the inventions of Joe Mako, another Zen Master. One of the solutions using Tableau's rank functions requires less steps. It reveals an intimate relationship between histogram and rank functions. The solution is however deeply buried in a 2-dimensional dynamic histogram maze.

    To illustrate this elegant solution, I am going to apply the new techniques in a simpler context here.

    I will use the example from an earlier article, based on a data set published by English Premier League 2014. Regardless of the final league standing of the 20 clubs, the Premier League allocated Facility Fees to each club depending on the number of matches broadcast by UK network. It's an indicator of the popularity of each club.

    Problem Statement
    There are 7 levels of facility fees allocated to 20 clubs in 2013-2114 season. We would like to know: for each fee level, how many clubs are there? 

    This is a very basic histogram question. There are many ways to calculate this. The purpose here is to illustrate the rank based approach, instead of seeking the easiest.

    Summary of the Approach
    A general approach of Histogram via Rank functions can be summarized as

    Step 1. Labeling the bins: [Label]
    Step 2. Counting the labels in each bin: (Set table calculation to compute use>Club)
    • Rank_Modified([Label]) - Rank([Label]) + 1
    Step 3. Indexing the labels in each bin:
    • Rank_Unique([Label]) - Rank([Label]) + 1
    Step 4. Filtering with Index=1 (Set table calculation to compute use>Club)

    With the above 4 steps, we can get a simple histogram as follows:
    Below we will explain the details of the approach.

    1.Labeling Bins
    We will use level of fees as bin. Each bin will create a partition among the clubs. Each partition has a number of clubs.

    In our case, there are 7 bins. Labels are friendly names for bins. They are easy to reference.

    We could have used "IF fees THEN fee level..." to label the levels. Because for different seasons, the fees may change and the number of levels may change, it is not convenient to hard code the numbers. Instead, we will use a calculated field to automatically generate the labels:
    • "Facility Fee Level "+ STR(Rank_Dense([Facility Fees]))
    Note that we take advantage of the fact that Rank_Dense([Facility Fees]) provides index for the fee levels in a descending order.
    2.Counting Labels per Bin

    I am going to describe a genius way of counting the number using Rank() and Rank_Modified(). Instead of counting the clubs per bin, we count the number of labels per bin. The number of labels per bin is equivalent to the number of clubs per bin anyway. The labels are all of the same value in a bin.

    The table below shows the comparison between Rank() and Rank_Modified(). 
    When there are equal values to rank, a spread is created between the top and the bottom within the same partition using the two different rankings. And the following relationship holds:
    • Spread = Partition Size - 1
    where spread = Rank_Modified([Bin Label])-Rank([Bin Label]). 

    Thus we get
    • Partition Size = Rank_Modified([Bin Label])-Rank([Bin Label])+1
    The partition size is equal to the number of labels in the partition. This provides label count per bin. We will use [Count_via_Rank] to denote this calculated field. Make sure to set table calculation to compute use>Club. 

    The relationship between rank and spread are shown below for every fee level and every club.

    We see that there are duplicate rows of the same count value per bin. Only one is needed in the expected histogram. Thus we have to filter all but one row per bin.

    3.Indexing Rows per Bin 

    For filtering the rows within a partition, we need to first index the rows of that partition.

    Rank_Unique([Bin Label]) provides a distinct ranking for every label, even for labels of the same value. And it is like an index to all the labels.

    The spread between Rank_Unique() and Rank() will provide a way to index the rows within each partition. Spread + 1 will make the index start from 1.

    The following formula provides a very convenient way to index the elements within a partition:
    • Rank_Unique([Bin Label]) - Rank([Bin Label]) + 1
    Let us have a calc field [Index_via_Rank] for the above formula.

    The relationship between Rank(), Rank_Unique() and the spread is shown in the picture below.
    4.Filtering 

    Drag [Index_via_Rank] to the filter shelf and select the value 1. Make sure to set table calculation to compute use>Club. We will get the following Histogram.
    Note that we added colors by Rank_Dense() and Measure Name in extra.

    Voila, we get our histogram. The workbook can be found here.

    Another example
    I built a histogram without table calculations earlier. It is for bar chart only and used some graphical tricks.

    Using the via rank approach, I created the same histogram in a few minutes. Here is the workbook for download.


    The via rank approach is using table calculation and rank functions. It provides a simple yet elegant way to count and index partitions. It can be applied to a broader arena of applications with or without bar chart, one dimensional or multidimensional histograms.




    0

    Add a comment

Blog Archive
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.