1.  geo quartile
    Today almost every data set in practice has got geographical attributes.

    In this post, I would like to discuss a way to analyse geography-based data set by quartile partitioning. That is, we partition the data set into 4 equal-size groups on contiguous area.

    The example data set will be the US Mass Shootings from 2013-2015Note that in the analysis, we only focus on the 48 continental states of USA.

    1.The Quartile Approach

    This is inspired by the widely used boxplot technique. Like boxplot, we will cut a data set into 4 groups of equal size, thus the quartile approach. Boxplot works along a single dimension. We will work with a 2-dimension twist here: longitude and latitude. In 2 dimensions, there are multiple ways to cut the data set into contiguous quartiles, as will explain below.

    1.1 Longitudinal Quartiles
    First, let's cut the data set into 4 equal groups along the longitude. This will create a new field [Quartile Long4]:

    • if rank_percentile(avg(longitude))<=0.25 then 'Q1'
    • elseif rank_percentile(avg(longitude))<=0.5 then 'Q2'
    • elseif rank_percentile(avg(longitude))<=0.75 then 'Q3'
    • else 'Q4'
    • end

    Using [Quartile Long4] to color the map,  we see where the quarterly cutpoints are for the mass shooting incidents. It is easy to see that three quarters of the incidents took place at the Eastern half of the United States.
    1.2 Latitudinal Quartiles
    In the same token, we can cut the data set into 4 groups of equal size. Here is the result. We see that one quartile is much narrower than the other three.
    1.3 2D Quartiles
    This consists of cutting a data set at the median point in one dimension first, then at the median in another dimension. The order of cutting matters. The resulting quartiles will be different. So we have two possibilities: longitude first or latitude first. The calculated field is as follows for longitude first:

    • if rank_percentile(avg(longitude))<=0.50 then 
    •   if  rank_percentile(if rank_percentile(avg(longitude))<=0.50 then avg(latitude) end) <=0.50
    •   then 'Q1'
    •   else 'Q2'
    •   end
    • else
    •   if  rank_percentile(if rank_percentile(avg(longitude))>0.50 then avg(latitude) end) <=0.50
    •   then 'Q3'
    •   else 'Q4'
    •   end
    • end

    Here is the result.
    We can then do the same for latitude first, which will result in 4 different groupings. Here is the result.
    1.4 Mixed Dimension Quartiles
    We can mix all the above approaches to create new combinations of quartiles.

    For example, we start by cutting longitude axis in half at the median point. In the left half of the median, we cut the same axis at the first quartile point. In the right half of the median, we cut along the latitude axis at the median point. The calculated field for partitioning [Long2 + Long x Lat] is as follows:

    • if rank_percentile(avg(Longitude))<=0.50 then
    •   if rank_percentile(avg(Longitude)) <=0.25
    •   then 'Q1'
    •   else 'Q2'
    •   end
    • else
    •   if rank_percentile(if rank_percentile(avg(Longitude))>0.50 then avg(Latitude) end) <=0.50
    •   then 'Q3'
    •   else 'Q4'
    •   end
    • end


    We can swap the above order to cut the first half along latitude axis, then cut the other half along longitude. This creates a new partition mode.
    In the same token, we can create 2 more partition modes by starting with latitude.

    1.5 Area Estimate and Density Index
    By visual inspection, we can kind of figure out the density of data samples in each quarterly group. Still, it is interesting to quantify the density. Here we propose a formula. 

    We will use 

    • (Max(Longitude)-Min(Longitude))*(Max(Latitude)-Min(Latitude)) 

    within each quartile as an estimate for its area. At coastal regions, this estimate may be less accurate.

    The density equals the number of data samples divided by the area. Since all groups have the same number of samples, the reverse of the area can be used as the relative density index between the groups. Thus we can use it to color the quartiles as a heatmap to show the relative density.

    1.6 Highest Density
    There are various ways to divide a data set into geographically-contiguous quartiles. In total, there are 8 partition modes as we showed above.

    Is there a partitioning better than another? It depends on the question we ask. One question I would like to ask is, which method produces one quartile with the highest density?

    Given the density formula, this can be done to each mode of partitioning. Then we can pick the smallest quartile to be the partition with the highest density. 

    1.7 Area of Interest
    We can apply filters to the data set to analyse only the area of interest to us. For example, we can use
    - Longitude or latitude filters to define the range of interest.
    - Map selector for arbitrary shape of area of interest.
    - State filter to analyse a single state.

    For example, here is a partitioning of California's data: [Lat2+Lat x Long]
    2.Extension to Octile Approach
    We can see that the idea can be extended to octile analysis. This will allow a finer partition of the map.

    While we have 8 modes of partitioning by quartiles, there are 128 ways of partitioning by octiles. Below is an octile partition: [Long4 x Lat2]
    3.Binary Tree or Recursive Partitioning
    Here we try to generalize the above quartile/octile approach to create even finer partitions.

    So we can always divide every partition in half at the median point, along one dimension or another. The partitioning path is like a binary tree. There are different orders to go along one dimension or another. This will create different partitions.

    The dividing can become recursive up to the granularity that we desire on the entire area of interest. 

    Below is an example of hexadecile partitioning: [Long4 x Lat4]
    4.Mixed Quantile Approach

    This is to say that the partitioning doesn't have to be the power of 2. It can be tertile, quintile, sextile etc. 

    At different dimensions, we can even apply different quantiles: tertile along one and quintile along another. The result would be 15-quantile partitions. Of course, there are different orders of applying the division.

    Quantiles are defined in Wikipedia as the cutpoints dividing a data set into equal sized groups. They are also the name for the partitions thus obtained. Some quantiles have special names as shown in the list below.

    5.Postscript

    In geography-based analysis, we often use country, state, city or zip code etc to partition the data. To some extent, those partitionings have their own merit. However those partitions come with different shapes and sizes, and different populations. Even normalization by population won't bring me peace of mind. I have never felt the comparison is based on equal footing. C'est la vie.

    I have been looking for a way to transcend those artificial divisions politically devised by human. Just try to be more objective in our analysis. And during the research on US mass shootings, I tried to add marginal boxplots along longitude and latitude. Eventually, I found that by directly coloring the map using quartiles, I got a quite interesting partitioning which gives me a sense of certainty in understanding and analyzing the spatial distribution.
    0

    Add a comment

  2.  Shootings
    US mass shootings have been a society problem for quite some years. Gun violence is a problem that is particular in the US where we have the best of technology and thinkers, but we don't know what to do with the guns.

    Here is an effort to understand the distribution of mass shooting incidents. Mass shooting is defined as an incident where 4 or more people got shot, whether killed or wounded. The study will lead us to gain certain insights into the pattern of the problem.

    The data set is from Shooting Tracker which is now merged with Gun Violence Archive.

    Temporal Distribution

    We have performed weekly, monthly and quarterly aggregations of the incidents. The insights are quite obvious:

    - Weekend incidents are way more than working days. Saturday ~2x more. Sunday ~3x.
    - Summer months may witness more mass shooting incidents.
    - Q3 is the highest in the number of incidents.

    Geo distribution

    We used quartile-based approach to analyze the distributions.

    By using quartiles along longitude, we see that 75% of events are in the Eastern part of the US.

    By using quartiles along longitude first, followed by latitude quartiles within each longitude quartile, we get 16 contiguous area of equal number of incidents. In other words, we created 16 quantiles. Then we calculate the size of each quantile and use the size value (actually using LN(size)) to create a reverse heatmap. Now we see right away that the metropolitans of New York and Chicago are the highest in incidents density.

    That's the research result of lately. Feel free to improve it and comment on it.

    Click the above image to view the interactive version and download the workbook.

    There are other Tableau visualizations on the incidents from different angles:

    Dash Davidson
    Andy Kriebel
    Michael Mixon
    1

    View comments

  3. I thought that dimension filtering and data blending are logically permutable. So that's how I drew the last diagram for the order of operations.

    Zen Master Jonathan Drummey pointed out that the actual queries for dimension filtering take place before data blending. So here is a new version incorporating Jonathan's contribution, showing the precedence of dimension filters over data blending.
    Within each filter categories, there could be subcategories of filters. They also have their own order of operations. Here are the latest diagrams for those within dimension and set filters.

    Related articles:



    0

    Add a comment

  4. We have worked to understand the order of operations in Tableau at a high level.

    Actually there are also filtering operations at a lower level. Here we are going to have a look into the order of operations within either of dimension filter and set filter.

    A good understanding of the filtering options will let us take advantage of the versatile functionalities of the dimension and set filters. Especially, the filters within are not affected by dimensions in view. Otherwise, equivalent individual filters may require more sophisticate setups, such as advanced table calculations.

    1. Dimension Filter
    For every dimension, when placed in the filter shelf, right click it and select the 'Filter' option, you will see this interface with 4 tabs: General, Wildcard, Condition, Top.
    These are actually 4 filters that you can set up to filter this dimension. Believe it or not, from left to right, this is the order of operations for these 4 filters. I figure this out by trial and error.
    Official document regarding dimension filters can be found here.

    2. Set Filter
    Creating a set is a way to define a subset of a dimension. Three filters can be defined when creating the set. Two filters can be defined after the set is placed in the filter shelf. There could be as many as 5 filters to be defined within a set filter.

    2.1 Filters in creating a set
    When creating a set, you are presented with this interface:
    Note there are 3 tabs: General | Condition | Top. Each tab is a filter definition.

    Thus there are 3 filters in creating a set. The order of operations is from left to right.

    2.2 Filters after placing the set in filter shelf
    When placing a set in the filter shelf, we can configure two more filters.
    2.3 The order of operations within set filters
    Again, the order of operations is from left to right. In total, we can define as many as 5 filters in cascade within a set filter.
    3. Minor Filter Descriptions (subcategories of filters)
    Inside both dimension and set filters, there are 4 minor types of filters: General, Wildcard, Condition, Top. Each filter will further narrow the scope of the data set. Below are brief descriptions of the filters.
    3.1 General filter
    There are 3 options in a general filter: Select from list | Custom value list | Use all

    It is a filter by checking on/off a list of dimension members. Here is an article on using custom value list.
    3.2 Wildcard filter
    This filter is string-search based on the name of the dimension members.
    3.3 Condition filter
    This is a measure filter, using one or more measures as conditions.
    3.4 Top filter
    This is a rank filter, which is a special table calculation filter. Using this one can avoid setting up sophisticated advanced table calculations when using a standalone rank filter.
    4.Conclusion
    By understanding the order of operations of the dimension and set filters, we can create filters in a compact way, instead of creating individual calculated fields as filter or setting up advanced table calculations. Here is a workbook comparing various ways of implementing the same filter.

    Note that set filter has precedence over dimension filter, and set filter is basically a special dimension filter, we may have as many as 5+4=9 filtering options for any particular dimension.

    The diagrams for the above order of operations are drawn in Tableau and can be downloaded here.

    0

    Add a comment

  5. Histogram can be created in many ways. The de facto histogram is built with bars. With Index() we can create one a bit more colorful. Click image below to see an interactive version or download the workbook. Will describe how to create this next.
    This one is created in Gantt chart, overlaid with boxplot and colored by [Profit]. It is regarding the customer distribution sliced by the number of orders, based on the superstore data set.

    The important steps to create the chart are:

    1.Create a calc field [CountD of Order] and move it to Columns. Turn it to continuous.
    2.Drag [Customer ID] to Detail shelf.
    3.Create Index=index() and move it to Rows. Set it to compute use [Customer ID].
    4.Select Gantt chart in Marks.
    5.Drag [Profit] to Color shelf.
    Now the Gantt marks are colored but not sorted.
    6.Right click Index and select Edit Table Calculation. In the advanced settings, set the sorting by Sum(Profit).
    7.Turn on boxplot either by adding reference line or dragging from the analytics tab.
    8.Add labels to the chart by holding CTRL key and drag the Index pill to the Label shelf. This makes a copy of the pill.

    Now we got an alternative histogram that tells us the distribution per number of orders, variance, profitability. Note that, with every mark out there, one can drill down to the gory details.



    0

    Add a comment

  6. Both Histogram and Box-n-Whisker Plot are popular tools to describe the distribution of data in different ways. They provide different insights into the distribution. It's quite interesting to overlay one with another.

    Today, we will show how to put them together in one chart.
    The above is an example using the superstore data set. The histogram is about the distribution of the number of products per the number of units sold. Then sliced by subcategories.

    The histogram is built via Size() according to an approach described in an earlier article. The difference here is that we are using LOD to calculate the number of orders per customer:
    • [Units Sold] = {FIXED [Product ID]: SUM(Number of Records)}
    The advantage of LOD expression is that it can be both a dimension and a continuous pill. A continuous axis will make a real histogram. (See Jonathan Drummey's comments)

    With a continuous axis, we can create a box plot!

    So it's pretty simple. Based on Histogram via Size() approach, we can create a histogram on the distribution of customers (Product ID as dimension).

    The marks are chosen to be stacked. Note that it is possible to minimize the number of marks in the chart. But we can't filter the nulls. Otherwise the box plot stats won't be correct.

    The interesting insights we get are:
    1.The distribution of the number of products over the number of units sold.
    2.The median and quartiles over the number of units sold

    The above idea came when I played with box plot over jitters as shown in this blog. The jitters are visually appealing. It shows the sample density distribution in a visual way, which is much like a histogram, but not quantified. I found that the dots can be organized as a histogram.

    The jitters are generated using Index(). We can also use index() to create a histogram.

    Voila, yet another addition to the series of histogram charting.

    PS.The bar chart doesn't have to be a histogram. It can be another measure. Here is the average product price over box plot.

    5

    View comments

  7. I have written a few blogs recently on the subject of data scaffolding. Let me make a summary of them.

    Data scaffolding is a technique to artificially create a data structure for the purpose of visualization. It will either reshape the original data or blend multiple data sources in such a way for better visualization.

    The technique is pioneered by Tableau Zen Master Joe Mako.

    The general methodology is as follows
    1.Create a table of pure dimensions to act as the primary data source. It sets up the structure necessary for visualization.
    2.Create measures from secondary data sources by blending.

    There are two major use cases for scaffolding: Data Reshaping and Data Blending

    Data Reshaping: single data source
    The original data structure is altered through the scaffolding, such as unpivoted, for better computation or better visualization.

    Scaffolding Video Lecture by Joe Mako
    Data Reshaping via Scaffolding

    Data Blending: multiple data sources
    In the regular blending, there could be data loss in secondary data sources, because it's kind of like a left join.

    In many cases, no one data source is more primary or secondary than the other sources. Thus we need a third party to act as the primary to all the actual data sources. This third party is a data scaffolding that is created artificially.

    Blending Data Via Multi-Dimensional Scaffolding
    Lossless Data Blending via Scaffolding
    Blending Dates via Scaffolding

    Examples of Scaffolding around the web:
    Facebook Jeopardy: Create a Single Sheet Waterfall Chart in Tableau
    GOOGLE ANALYTICS IN TABLEAU: BLENDING DATA FROM MULTIPLE ACCOUNTS
    Basic Monte Carlo Simulations in Tableau


    0

    Add a comment

  8. So far, we have talked about data blending via scaffolding: 1,2,3. Blending involves 2 or more data sources.

    Data reshaping is about a single data source. By scaffolding, we can alter or transform the data structure in order to create visualizations that was not straightforward using the original data set.

    Again, Zen Master Joe Mako has lectured about scaffolding in an hour long video focusing on data reshaping or dealing with a single data source. He has included 4 examples. Here we are going to include 2 more examples.

    Why alter the data structure? Because Tableau requires the data set to be in certain structure to be rendered as charts and tables. In other words, we need the right dimensions which can generate the required number of marks.

    Example 1. Data Padding Via Scaffold
    In this use case, some manager positions are missing in some regions. The rule is to make the assistant manager to be the acting manager. Let's see how this can be done.

    So the original data set is as follows
    We can see that some of the manager positions are missing. Not all regions have managers. The desired result is as follows:
    Let's create a scaffolding which corresponds to the desired structure:
    It has 2 dimensions and no measures. All we need to do is to fill the values in this scaffold.

    1.Create the scaffolding table in Excel. (Here we have a small table with only 2 dimensions. If more dimensions, see this article for creating multidimensional scaffolding.)

    2.Import this table to Tableau as a data source. It will be used as the primary data source.

    3.Blend the data set with this scaffolding table.

    4.Create a calc field [Employee] in the scaffold, which is a reference to the secondary data set.

    5.Create a calc field [Emp++] in the scaffold:
    • IF ISNULL([Employee]) and ATTR( [Title] )='Manager'
    • THEN WINDOW_MAX( [Employee] )
    • ELSE [Employee]
    • END
    6.Drag the [Emp++] to the rows shelf and set the table calculation to advanced mode. Set the partition to be Region and move the rest to addressing.
    Thus we get the result as expected. The workbook can be downloaded here.

    Example 2. Creating a Dimension for Measure Names
    It is often desirable that we can use [Measure Names] as a dimension. But we can't. Then scaffolding comes to the rescue. Here is a real life question that Joe Mako helped in solving the puzzle via scaffolding.

    The data set is as follows. It is a survey on 3 questions which respondents will answer with yes or no. There are a number of respondents during 4/1/2015 and 5/23/2015.
    The goal is to tally the percentage of yes or no per day per each survey question like this:
    In the original data set, each question is a column with attribute. There are 3 questions. What we need is a single [Survey] dimension that contains the 3 questions with [Answer] like yes or no. So we build a scaffolding like:
    Then we need to build measures like Count and Count % per question per day.
    1.Date is created through [Record] with 2 records, that allow us to create Start Date and End Date via 2 parameters. Through Show Missing Values, we get all the dates between the 2 dates. This will allow us to view data per day, per week, per month etc.

    2.Create [Answer (Fill)] and [Survey (Fill)]
    These two new fields are going to fill every data marks in the table. Total() function is a way to densify the data to where there were no data marks before.
    • Answer (Fill) = Total(Max(Answer))
    • Survey(Fill) = Total(Max(Survey))
    3.Create measures Count and Count %
    Set them to compute using Date.
    4.Thus we get our chart and table to visualize the survey result per day.

    So you see how scaffolding is used to create the table and chart! The workbook can be downloaded from here.
    0

    Add a comment

  9. Since my last article on this topic, there have been some discussions. The real diagram could be more complicated than what is been drawn here. The diagram here could be a gross approximation to the real one. Let's try to make it evolve towards the ultimate one.

    So here are a few updates:

    - added Custom SQL which is an integral part of the Tableau functions. It can be used as the first filter and transformer for the raw data.
    - added context filter for the secondary data source before blending.
    - moved blending to be after context filters.

    I would like to point out that the dimension filters have two genres:
    - Linking dimension filters will operate on both primary and secondary data source.
    - Non-linking dimension filters will operate on their respective data source.

    Linking dimension filters work after blending.

    Non-linking dimension filters seem to work before blending. But as long as they only filter their respective data source with or without blending, the order of operations between blending and non-linking dimension filters doesn't matter . We can thus assume all dimension filters operate after blending.

    So here is the new order of operations diagram drawn in Tableau. Click to see the viz.
    6

    View comments

  10. Right before the 2015 Tableau Conference, Rody Zakovich of the Tableau community forum initiated a series of appreciation pieces to a select group of members, including me:
    Community Appreciation - Alexander Mou

    I am very much moved. It meant a lot to me. I have been in the forum for over a year and I am greatly impressed by the amazing energy and camaraderie in the community. The people there are great and friendly. They can help you solve any Tableau problem. They kind of act as the first line technical support on behalf of Tableau Inc. I leant a great deal by helping others as well.

    As my Tableau skills got better, I started answering questions from other members. I became more active this year. Tracy Fitzgerald has been writing a weekly digest for the community. From those digests, I am able to find my progress route:

    Community Digest - Ending 3/16: 25 correct answers and Questions for Breakfast badge
    Community Digest - Ending 4/6: 50 correct answers and Question Slayer badge
    Community Digest - Ending 6/8: 100 correct answers and Answer Wizard badge which seems to be the highest distinction for a forum member.

    I keep working on the problems posted in the forum. In a good week, I can even get 10 correct answers:
    Community Digest - Ending 6/29

    The forum has become a great inspiration for my blog. You might have noticed that I referred to the forum posts quite often.

    And one day I was looking for the Crow's Nest (a place where questions got zero answer), I discovered by chance that I was at the top of the leaderboard. The ranking is based how many people I have helped during the past 12 months. I may disappear from the dynamic board if other people do more than I do. This is a screenshot I took today on 10/25/2015.
    Below is the note I wrote at the forum in response to the community appreciation:

    Rody, Mark and Simon, thanks for the kind words!  I am so touched.

    Actually I probably learnt more from people in this forum than I put into it. It is such a vibrant community to which I am very grateful. There are so many interesting problems to solve. It is kind of addictive.

    Yes, I have answered a few posts in Chinese, Korean, Japanese, French, Portuguese, Spanish etc, which are less likely to get an answer. As long as people speak Tableau, the lingua franca of this forum, they deserve an answer. Many non-native English speakers may not write in English. But they most likely can read English. I don't speak most of their languages. Armed with Google Translate, I can figure out most of what they intend to say. Patrick Van Der Hyde and Diego Medrano do that all the time. Anyone here can do it too. English not being my native language, I apologize myself to her Majesty if I misused or abused Queen's English. BTW, her Majesty must be really proud if she knew that a contingent of British Empire is colonizing the leader board of the new found territory called Tableau World.

    I would encourage anyone here to start a blog. Many questions are recurring ones or following some kind of pattern. They are déjà vu. A quick reply would be a pointer to some blog posts. In other words, blogs make knowledge scalable. Otherwise, we have to answer questions one at a time. Sometimes do so repetitively. Given that Tableau has become a victim of its own success, more questions will inundate the forum.

    Also, organizing information creates value, not only to the general knowledge, but also to the blogger himself. For me, I write therefore I learn. If I can't write it, I don't understand it, at least not to the degree I would like to be.

    There are many people I am very grateful to in the Tableau community where diverse yet like-minded people gathered and helped each other to learn and to create. In particular, I would express my appreciation to Andy Kriebel whose blog is the first one I read. It opened my eyes to the fantastic world of data visualization. Also I would thank Jonathan Drummey, Joshua Milligan and Mark Jackson from whose blogs I learnt a great deal.  Last but not the least, I would thank Joe Mako for his generosity of solving whatever problems that were too difficult for me and offering to have screen share sessions. As I have found out along the way, many of the techniques such as scaffolding, data densification etc, can all be traced back to Joe, the source.

    Keep learning and have fun! Hope to meet you at TC15.

    0

    Add a comment

Blog Archive
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.