1. Christine suggested me to have a look at Simpson's Paradox, following my recent posts on Anscrombe's Quartet and Datasaurus Dozen. They are all about learning to look at statistics in an impartial way.

    Simpson's Paradox is about the difference between the stats of an entire data set and the stats of the same data set sliced by a dimension. They can be quite different or even contradictory. We can't take one for the other.

    We are going to show some visualization techniques to compare the whole vs the parts through two examples.

    UC Berkeley Admission Gender Bias

    The data is from here. From the campus total percentage, we see that the admission rate is 39%. Then men's rate is 45% and women's is 30%. So it seems that there is a campus wide bias against women. However at the department level, we see that departments A and B have very high admission rates for women. They are even higher than those for men.

    So it's not enough to just draw conclusions from the stats of the entire school. It may not be fair to some of the departments. We need to look further into the departments. 

    Superstore Sales Trend

    The similar story goes for sales trend. We can calculate the growth rate trend of sales for the whole business. But we also need to look into the sales trend of every product category, or that of each state. The overall trend may not represent that of a single category or of a single state.

    We can see that copiers and phones sales are growing faster than the overall business. Some are growing in a rate similar to the overall rate. Some are flat and lag behind. 

    Because we are comparing only the gradients of sales trend, the dual axis are not synchronized.

    Feel free to download the workbook and explore it. Leave comments if you have something to share.


    0

    Add a comment

  2. #TweakThursday: From time to time I tweak someone else's public viz and try to make it better to my subjective view.

    How does one use horizontal bars and vertical bars? How to order time-based multiples in a trellis chart?

    Here are my own rules of thumb:
    • Vertical bars are for time-based trends.
    • Horizontal bars are for categorical comparison.
    • Always place the latest cell in a time-based trellis at the top-left corner where the focus is.
    I found that Professor Klaus Schulte's recent MakeoverMonday viz submission is quite interesting. I just tweaked it a bit according to the above rules. Hope it makes the viz a bit more interesting.

    Here is my resulting viz. Feel free to download it and explore.

    Note that the vertical axis is the bins for the histogram. We easily see France is placed in higher bins than the other 3 highlighted countries recently. Since the most recent is more interesting than the past, placing the the latest cells at the top left corner makes it easier to check.
    Had a few exchanges with Klaus in Twitter about my reasoning behind the edit.



    0

    Add a comment

  3. This post is about 13 data sets, known as Datasaurus Dozen, that have the same stats and different distributions. Stats can be deceiving while data visualization can makes a big difference.

    Inspired by Anscombe's quartet and Alberto Cairo's Datasaurus, Justin Matejka and George Fitzmaurice crafted another 12 datasets which have the same stats and different distributions. Thus the Datasaurus and the Dozen.

    Here I recreated them in a Tableau dashboard and calculated dynamically all the summary stats using the native table calculation functions in Tableau. We verified that the 13 datasets have the same stats. But they do visualize differently. Feel free to download it and explore it on desktop.

    I noted that R-Squared is a bit different from each other, while Anscombe's quartet's R-squared's are much closer to each other.
    By the way, it's the first time for me to use the native animation function in Tableau to create this gif. Unfortunately Tableau Public can't run this because it says the chart is too complex. If you wish, download it to your desktop where one can view much better visual effect.


    Happy exploring with Tableau!
    0

    Add a comment

  4. Francis Anscombe, a British statistician and a professor at Princeton and Yale, constructed 4 different sets of data which all have the same stats, known as Anscombe's quartet. However the quartet's data distributions are quite different. 

    Stats alone can be deceiving. Through data visualization, we can gain powerful insights into their differences. 

    So, I decided to render Anscombe's quartet in Tableau. All calculations are based on Tableau's native functions. Without this exercise, I may never get chances to use some of the statistical functions in Tableau. Hope that this can inspire more people to use them, such as:

    Variance: WINDOW_VAR(SUM(X))
    Correlation: WINDOW_CORR(SUM(X), SUM(Y))

    The stats summary is generated dynamically and displayed via annotation.

    Here is the resulting dashboard, rendered in a single sheet. Feel free to download it.

    All the trend lines are also identical after being rounded to two decimals. The trend lines are generated by Tableau based on data. We can see that the R-Squared and P-value are also the same.

    Here is the quartet's data:
    Anscombe wanted to let people know that stats are not enough to characterize a data set. Visualization is important to help us understand data and get more insights into the data. He wrote a 5-page paper in 1973 to stress on using graphs for statistical analysis.
    Hope that this helps us better understand the value of data visualization.


    0

    Add a comment

Blog Archive
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.