In this post, I would like to discuss a way to analyse geography-based data set by quartile partitioning. That is, we partition the data set into 4 equal-size groups on contiguous area.
The example data set will be the US Mass Shootings from 2013-2015. Note that in the analysis, we only focus on the 48 continental states of USA.
This is inspired by the widely used boxplot technique. Like boxplot, we will cut a data set into 4 groups of equal size, thus the quartile approach. Boxplot works along a single dimension. We will work with a 2-dimension twist here: longitude and latitude. In 2 dimensions, there are multiple ways to cut the data set into contiguous quartiles, as will explain below.
1.1 Longitudinal Quartiles
First, let's cut the data set into 4 equal groups along the longitude. This will create a new field [Quartile Long4]:
- if rank_percentile(avg(longitude))<=0.25 then 'Q1'
- elseif rank_percentile(avg(longitude))<=0.5 then 'Q2'
- elseif rank_percentile(avg(longitude))<=0.75 then 'Q3'
- else 'Q4'
- end
Using [Quartile Long4] to color the map, we see where the quarterly cutpoints are for the mass shooting incidents. It is easy to see that three quarters of the incidents took place at the Eastern half of the United States.
1.2 Latitudinal Quartiles
In the same token, we can cut the data set into 4 groups of equal size. Here is the result. We see that one quartile is much narrower than the other three.
1.3 2D Quartiles
This consists of cutting a data set at the median point in one dimension first, then at the median in another dimension. The order of cutting matters. The resulting quartiles will be different. So we have two possibilities: longitude first or latitude first. The calculated field is as follows for longitude first:
- if rank_percentile(avg(longitude))<=0.50 then
- if rank_percentile(if rank_percentile(avg(longitude))<=0.50 then avg(latitude) end) <=0.50
- then 'Q1'
- else 'Q2'
- end
- else
- if rank_percentile(if rank_percentile(avg(longitude))>0.50 then avg(latitude) end) <=0.50
- then 'Q3'
- else 'Q4'
- end
- end
Here is the result.
We can then do the same for latitude first, which will result in 4 different groupings. Here is the result.
1.4 Mixed Dimension Quartiles
We can mix all the above approaches to create new combinations of quartiles.
For example, we start by cutting longitude axis in half at the median point. In the left half of the median, we cut the same axis at the first quartile point. In the right half of the median, we cut along the latitude axis at the median point. The calculated field for partitioning [Long2 + Long x Lat] is as follows:
- if rank_percentile(avg(Longitude))<=0.50 then
- if rank_percentile(avg(Longitude)) <=0.25
- then 'Q1'
- else 'Q2'
- end
- else
- if rank_percentile(if rank_percentile(avg(Longitude))>0.50 then avg(Latitude) end) <=0.50
- then 'Q3'
- else 'Q4'
- end
- end
We can swap the above order to cut the first half along latitude axis, then cut the other half along longitude. This creates a new partition mode.
In the same token, we can create 2 more partition modes by starting with latitude.
1.5 Area Estimate and Density Index
By visual inspection, we can kind of figure out the density of data samples in each quarterly group. Still, it is interesting to quantify the density. Here we propose a formula.
We will use
- (Max(Longitude)-Min(Longitude))*(Max(Latitude)-Min(Latitude))
within each quartile as an estimate for its area. At coastal regions, this estimate may be less accurate.
The density equals the number of data samples divided by the area. Since all groups have the same number of samples, the reverse of the area can be used as the relative density index between the groups. Thus we can use it to color the quartiles as a heatmap to show the relative density.
1.6 Highest Density
There are various ways to divide a data set into geographically-contiguous quartiles. In total, there are 8 partition modes as we showed above.
Is there a partitioning better than another? It depends on the question we ask. One question I would like to ask is, which method produces one quartile with the highest density?
Given the density formula, this can be done to each mode of partitioning. Then we can pick the smallest quartile to be the partition with the highest density.
1.7 Area of Interest
We can apply filters to the data set to analyse only the area of interest to us. For example, we can use
- Longitude or latitude filters to define the range of interest.
- Map selector for arbitrary shape of area of interest.
- State filter to analyse a single state.
For example, here is a partitioning of California's data: [Lat2+Lat x Long]
2.Extension to Octile Approach
We can see that the idea can be extended to octile analysis. This will allow a finer partition of the map.
While we have 8 modes of partitioning by quartiles, there are 128 ways of partitioning by octiles. Below is an octile partition: [Long4 x Lat2]
3.Binary Tree or Recursive Partitioning
Here we try to generalize the above quartile/octile approach to create even finer partitions.
So we can always divide every partition in half at the median point, along one dimension or another. The partitioning path is like a binary tree. There are different orders to go along one dimension or another. This will create different partitions.
The dividing can become recursive up to the granularity that we desire on the entire area of interest.
Below is an example of hexadecile partitioning: [Long4 x Lat4]
4.Mixed Quantile Approach
This is to say that the partitioning doesn't have to be the power of 2. It can be tertile, quintile, sextile etc.
At different dimensions, we can even apply different quantiles: tertile along one and quintile along another. The result would be 15-quantile partitions. Of course, there are different orders of applying the division.
Quantiles are defined in Wikipedia as the cutpoints dividing a data set into equal sized groups. They are also the name for the partitions thus obtained. Some quantiles have special names as shown in the list below.
5.Postscript
In geography-based analysis, we often use country, state, city or zip code etc to partition the data. To some extent, those partitionings have their own merit. However those partitions come with different shapes and sizes, and different populations. Even normalization by population won't bring me peace of mind. I have never felt the comparison is based on equal footing. C'est la vie.
I have been looking for a way to transcend those artificial divisions politically devised by human. Just try to be more objective in our analysis. And during the research on US mass shootings, I tried to add marginal boxplots along longitude and latitude. Eventually, I found that by directly coloring the map using quartiles, I got a quite interesting partitioning which gives me a sense of certainty in understanding and analyzing the spatial distribution.
Add a comment