Friday, October 28, 2022

Resolving Simpson's Paradox: The Whole vs The Parts

Christine suggested me to have a look at Simpson's Paradox, following my recent posts on Anscrombe's Quartet and Datasaurus Dozen. They are all about learning to look at statistics in an impartial way.

Simpson's Paradox is about the difference between the stats of an entire data set and the stats of the same data set sliced by a dimension. They can be quite different or even contradictory. We can't take one for the other.

We are going to show some visualization techniques to compare the whole vs the parts through two examples.

UC Berkeley Admission Gender Bias

The data is from here. From the campus total percentage, we see that the admission rate is 39%. Then men's rate is 45% and women's is 30%. So it seems that there is a campus wide bias against women. However at the department level, we see that departments A and B have very high admission rates for women. They are even higher than those for men.

So it's not enough to just draw conclusions from the stats of the entire school. It may not be fair to some of the departments. We need to look further into the departments. 

Superstore Sales Trend

The similar story goes for sales trend. We can calculate the growth rate trend of sales for the whole business. But we also need to look into the sales trend of every product category, or that of each state. The overall trend may not represent that of a single category or of a single state.

We can see that copiers and phones sales are growing faster than the overall business. Some are growing in a rate similar to the overall rate. Some are flat and lag behind. 

Because we are comparing only the gradients of sales trend, the dual axis are not synchronized.

Feel free to download the workbook and explore it. Leave comments if you have something to share.


No comments:

Post a Comment