1.  geo quartile
    Today almost every data set in practice has got geographical attributes.

    In this post, I would like to discuss a way to analyse geography-based data set by quartile partitioning. That is, we partition the data set into 4 equal-size groups on contiguous area.

    The example data set will be the US Mass Shootings from 2013-2015Note that in the analysis, we only focus on the 48 continental states of USA.

    1.The Quartile Approach

    This is inspired by the widely used boxplot technique. Like boxplot, we will cut a data set into 4 groups of equal size, thus the quartile approach. Boxplot works along a single dimension. We will work with a 2-dimension twist here: longitude and latitude. In 2 dimensions, there are multiple ways to cut the data set into contiguous quartiles, as will explain below.

    1.1 Longitudinal Quartiles
    First, let's cut the data set into 4 equal groups along the longitude. This will create a new field [Quartile Long4]:

    • if rank_percentile(avg(longitude))<=0.25 then 'Q1'
    • elseif rank_percentile(avg(longitude))<=0.5 then 'Q2'
    • elseif rank_percentile(avg(longitude))<=0.75 then 'Q3'
    • else 'Q4'
    • end

    Using [Quartile Long4] to color the map,  we see where the quarterly cutpoints are for the mass shooting incidents. It is easy to see that three quarters of the incidents took place at the Eastern half of the United States.
    1.2 Latitudinal Quartiles
    In the same token, we can cut the data set into 4 groups of equal size. Here is the result. We see that one quartile is much narrower than the other three.
    1.3 2D Quartiles
    This consists of cutting a data set at the median point in one dimension first, then at the median in another dimension. The order of cutting matters. The resulting quartiles will be different. So we have two possibilities: longitude first or latitude first. The calculated field is as follows for longitude first:

    • if rank_percentile(avg(longitude))<=0.50 then 
    •   if  rank_percentile(if rank_percentile(avg(longitude))<=0.50 then avg(latitude) end) <=0.50
    •   then 'Q1'
    •   else 'Q2'
    •   end
    • else
    •   if  rank_percentile(if rank_percentile(avg(longitude))>0.50 then avg(latitude) end) <=0.50
    •   then 'Q3'
    •   else 'Q4'
    •   end
    • end

    Here is the result.
    We can then do the same for latitude first, which will result in 4 different groupings. Here is the result.
    1.4 Mixed Dimension Quartiles
    We can mix all the above approaches to create new combinations of quartiles.

    For example, we start by cutting longitude axis in half at the median point. In the left half of the median, we cut the same axis at the first quartile point. In the right half of the median, we cut along the latitude axis at the median point. The calculated field for partitioning [Long2 + Long x Lat] is as follows:

    • if rank_percentile(avg(Longitude))<=0.50 then
    •   if rank_percentile(avg(Longitude)) <=0.25
    •   then 'Q1'
    •   else 'Q2'
    •   end
    • else
    •   if rank_percentile(if rank_percentile(avg(Longitude))>0.50 then avg(Latitude) end) <=0.50
    •   then 'Q3'
    •   else 'Q4'
    •   end
    • end


    We can swap the above order to cut the first half along latitude axis, then cut the other half along longitude. This creates a new partition mode.
    In the same token, we can create 2 more partition modes by starting with latitude.

    1.5 Area Estimate and Density Index
    By visual inspection, we can kind of figure out the density of data samples in each quarterly group. Still, it is interesting to quantify the density. Here we propose a formula. 

    We will use 

    • (Max(Longitude)-Min(Longitude))*(Max(Latitude)-Min(Latitude)) 

    within each quartile as an estimate for its area. At coastal regions, this estimate may be less accurate.

    The density equals the number of data samples divided by the area. Since all groups have the same number of samples, the reverse of the area can be used as the relative density index between the groups. Thus we can use it to color the quartiles as a heatmap to show the relative density.

    1.6 Highest Density
    There are various ways to divide a data set into geographically-contiguous quartiles. In total, there are 8 partition modes as we showed above.

    Is there a partitioning better than another? It depends on the question we ask. One question I would like to ask is, which method produces one quartile with the highest density?

    Given the density formula, this can be done to each mode of partitioning. Then we can pick the smallest quartile to be the partition with the highest density. 

    1.7 Area of Interest
    We can apply filters to the data set to analyse only the area of interest to us. For example, we can use
    - Longitude or latitude filters to define the range of interest.
    - Map selector for arbitrary shape of area of interest.
    - State filter to analyse a single state.

    For example, here is a partitioning of California's data: [Lat2+Lat x Long]
    2.Extension to Octile Approach
    We can see that the idea can be extended to octile analysis. This will allow a finer partition of the map.

    While we have 8 modes of partitioning by quartiles, there are 128 ways of partitioning by octiles. Below is an octile partition: [Long4 x Lat2]
    3.Binary Tree or Recursive Partitioning
    Here we try to generalize the above quartile/octile approach to create even finer partitions.

    So we can always divide every partition in half at the median point, along one dimension or another. The partitioning path is like a binary tree. There are different orders to go along one dimension or another. This will create different partitions.

    The dividing can become recursive up to the granularity that we desire on the entire area of interest. 

    Below is an example of hexadecile partitioning: [Long4 x Lat4]
    4.Mixed Quantile Approach

    This is to say that the partitioning doesn't have to be the power of 2. It can be tertile, quintile, sextile etc. 

    At different dimensions, we can even apply different quantiles: tertile along one and quintile along another. The result would be 15-quantile partitions. Of course, there are different orders of applying the division.

    Quantiles are defined in Wikipedia as the cutpoints dividing a data set into equal sized groups. They are also the name for the partitions thus obtained. Some quantiles have special names as shown in the list below.

    5.Postscript

    In geography-based analysis, we often use country, state, city or zip code etc to partition the data. To some extent, those partitionings have their own merit. However those partitions come with different shapes and sizes, and different populations. Even normalization by population won't bring me peace of mind. I have never felt the comparison is based on equal footing. C'est la vie.

    I have been looking for a way to transcend those artificial divisions politically devised by human. Just try to be more objective in our analysis. And during the research on US mass shootings, I tried to add marginal boxplots along longitude and latitude. Eventually, I found that by directly coloring the map using quartiles, I got a quite interesting partitioning which gives me a sense of certainty in understanding and analyzing the spatial distribution.
    0

    Add a comment

  2.  Shootings
    US mass shootings have been a society problem for quite some years. Gun violence is a problem that is particular in the US where we have the best of technology and thinkers, but we don't know what to do with the guns.

    Here is an effort to understand the distribution of mass shooting incidents. Mass shooting is defined as an incident where 4 or more people got shot, whether killed or wounded. The study will lead us to gain certain insights into the pattern of the problem.

    The data set is from Shooting Tracker which is now merged with Gun Violence Archive.

    Temporal Distribution

    We have performed weekly, monthly and quarterly aggregations of the incidents. The insights are quite obvious:

    - Weekend incidents are way more than working days. Saturday ~2x more. Sunday ~3x.
    - Summer months may witness more mass shooting incidents.
    - Q3 is the highest in the number of incidents.

    Geo distribution

    We used quartile-based approach to analyze the distributions.

    By using quartiles along longitude, we see that 75% of events are in the Eastern part of the US.

    By using quartiles along longitude first, followed by latitude quartiles within each longitude quartile, we get 16 contiguous area of equal number of incidents. In other words, we created 16 quantiles. Then we calculate the size of each quantile and use the size value (actually using LN(size)) to create a reverse heatmap. Now we see right away that the metropolitans of New York and Chicago are the highest in incidents density.

    That's the research result of lately. Feel free to improve it and comment on it.

    Click the above image to view the interactive version and download the workbook.

    There are other Tableau visualizations on the incidents from different angles:

    Dash Davidson
    Andy Kriebel
    Michael Mixon
    1

    View comments

  3. I thought that dimension filtering and data blending are logically permutable. So that's how I drew the last diagram for the order of operations.

    Zen Master Jonathan Drummey pointed out that the actual queries for dimension filtering take place before data blending. So here is a new version incorporating Jonathan's contribution, showing the precedence of dimension filters over data blending.
    Within each filter categories, there could be subcategories of filters. They also have their own order of operations. Here are the latest diagrams for those within dimension and set filters.

    Related articles:



    0

    Add a comment

  4. We have worked to understand the order of operations in Tableau at a high level.

    Actually there are also filtering operations at a lower level. Here we are going to have a look into the order of operations within either of dimension filter and set filter.

    A good understanding of the filtering options will let us take advantage of the versatile functionalities of the dimension and set filters. Especially, the filters within are not affected by dimensions in view. Otherwise, equivalent individual filters may require more sophisticate setups, such as advanced table calculations.

    1. Dimension Filter
    For every dimension, when placed in the filter shelf, right click it and select the 'Filter' option, you will see this interface with 4 tabs: General, Wildcard, Condition, Top.
    These are actually 4 filters that you can set up to filter this dimension. Believe it or not, from left to right, this is the order of operations for these 4 filters. I figure this out by trial and error.
    Official document regarding dimension filters can be found here.

    2. Set Filter
    Creating a set is a way to define a subset of a dimension. Three filters can be defined when creating the set. Two filters can be defined after the set is placed in the filter shelf. There could be as many as 5 filters to be defined within a set filter.

    2.1 Filters in creating a set
    When creating a set, you are presented with this interface:
    Note there are 3 tabs: General | Condition | Top. Each tab is a filter definition.

    Thus there are 3 filters in creating a set. The order of operations is from left to right.

    2.2 Filters after placing the set in filter shelf
    When placing a set in the filter shelf, we can configure two more filters.
    2.3 The order of operations within set filters
    Again, the order of operations is from left to right. In total, we can define as many as 5 filters in cascade within a set filter.
    3. Minor Filter Descriptions (subcategories of filters)
    Inside both dimension and set filters, there are 4 minor types of filters: General, Wildcard, Condition, Top. Each filter will further narrow the scope of the data set. Below are brief descriptions of the filters.
    3.1 General filter
    There are 3 options in a general filter: Select from list | Custom value list | Use all

    It is a filter by checking on/off a list of dimension members. Here is an article on using custom value list.
    3.2 Wildcard filter
    This filter is string-search based on the name of the dimension members.
    3.3 Condition filter
    This is a measure filter, using one or more measures as conditions.
    3.4 Top filter
    This is a rank filter, which is a special table calculation filter. Using this one can avoid setting up sophisticated advanced table calculations when using a standalone rank filter.
    4.Conclusion
    By understanding the order of operations of the dimension and set filters, we can create filters in a compact way, instead of creating individual calculated fields as filter or setting up advanced table calculations. Here is a workbook comparing various ways of implementing the same filter.

    Note that set filter has precedence over dimension filter, and set filter is basically a special dimension filter, we may have as many as 5+4=9 filtering options for any particular dimension.

    The diagrams for the above order of operations are drawn in Tableau and can be downloaded here.

    0

    Add a comment

Blog Archive
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.