Friday, November 20, 2015

Overlaying Histogram with Box and Whisker Plot

Both Histogram and Box-n-Whisker Plot are popular tools to describe the distribution of data in different ways. They provide different insights into the distribution. It's quite interesting to overlay one with another.

Today, we will show how to put them together in one chart.
The above is an example using the superstore data set. The histogram is about the distribution of the number of products per the number of units sold. Then sliced by subcategories.

The histogram is built via Size() according to an approach described in an earlier article. The difference here is that we are using LOD to calculate the number of orders per customer:
  • [Units Sold] = {FIXED [Product ID]: SUM(Number of Records)}
The advantage of LOD expression is that it can be both a dimension and a continuous pill. A continuous axis will make a real histogram. (See Jonathan Drummey's comments)

With a continuous axis, we can create a box plot!

So it's pretty simple. Based on Histogram via Size() approach, we can create a histogram on the distribution of customers (Product ID as dimension).

The marks are chosen to be stacked. Note that it is possible to minimize the number of marks in the chart. But we can't filter the nulls. Otherwise the box plot stats won't be correct.

The interesting insights we get are:
1.The distribution of the number of products over the number of units sold.
2.The median and quartiles over the number of units sold

The above idea came when I played with box plot over jitters as shown in this blog. The jitters are visually appealing. It shows the sample density distribution in a visual way, which is much like a histogram, but not quantified. I found that the dots can be organized as a histogram.

The jitters are generated using Index(). We can also use index() to create a histogram.

Voila, yet another addition to the series of histogram charting.

PS.The bar chart doesn't have to be a histogram. It can be another measure. Here is the average product price over box plot.

5 comments:

  1. As a use case, just added boxplots to Shine's viz https://twitter.com/vizshine/status/642707876348317696

    ReplyDelete
  2. I would like to point out that the each bar in the histogram has multiple marks stacking up. See this on marks in histogram http://vizdiff.blogspot.com/2015/05/histogram-via-size.html

    By lowering the color transparency, we make the stacked marks exhibit varying hue intensity by the number of data samples. It becomes a heatmap. Vertically, the scale is different for each state. But the hue will give a hint on the difference in the number of samples.

    ReplyDelete
  3. This does not actually represent the distribution of the customers. Notice that the middle of the box plot is always on the center bar, no matter how skewed the distribution is. By default reference lines are computed on the aggregate values, not the underlying values.

    ReplyDelete
    Replies
    1. Thanks for pointing out the errors in the initial examples. Just re-created the example.

      Delete