Following previous post on Iris Classification via Clustering in Tableau, we are going to analyse and compare the quality of clustering in Tableau by using 1,2,3 and 4 variables based on the Iris data set. We will use both automatic clustering and pre-specified number of clusters.

In the Iris data set, we know there are 3 clusters or classes of flowers. In some of the cases with a combination of measures/variables, Tableau can figure out automatically there are 3 clusters. We don't need to tell it a priori. In other cases, Tableau will find more or less than 3 by automatic clustering. Then in order to compare with all others, we will pre-specify the number of clusters to be 3.

Variable and Clustering Settings

- Create a scatter plot using Petal Length, Petal Width and ID (See previou post)

- Go the Analytics tab and drag Cluster to the canvas to create clusters

- In the Variable setting, we can add/remove variables. The default variables are the ones in view, like Petal Length and Petal Width. But they don't have to be the ones used for clustering. We can add 1 to 4 variables as there are in total 4 measures in the data set.

- In the Cluster setting, we can either leave it to be automatic or specify the number of clusters. 

Clustering

There are two options for clustering: automatic or pre-specify. The produced cluster can be moved to the dimension pane as groups.

The symbol before the cluster name means geographic dimension. In reality, each cluster is like a group.

Automatic Clustering

We found that there are 9 combinations of variables where the automatic clustering produces the expected number of clusters: 3

Pre-Specify the Number of Clusters

In other cases than the above, the automatic option does not produce the expected number of clusters. Since we know there are 3 clusters and for the purpose of comparison, we can set the number of clusters to be 3. The cases include all the possible combinations of measures from 1 variable to 4 variables.

Variable Sets

Given 4 possible measures, we have 15 possible sets of variables as follows.

We have a shortened notion for the above sets, for example, PLSW = Petal Length / Sepal Width 
where P: Petal S: Sepal L: Length W: Width. (*) means pre-specified number of clusters.

Parameterized Scatter Plot

We use a single scatter plot to show the distribution of the 150 data marks with Petal Width and Petal Length as axis. The cluster is controlled by a parameter. Depending on the variable set, the resulting cluster (groups of IDs) is selected to color the data marks where some are mismatches and others are matches. A filter allows us to view only the matched or mismatched ones.

Quality of the Clustering

The fewer the mismatches the better the variable set under the algorithm. Using Petal Length and Petal Width (PLPW) we got the best match for automatic clustering: 6 mismatches out of 150 records. Using Petal Width (PW) alone, we got the best match with pre-specified number of clusters: 6 mismatches out of 150 records. Petal Length (PL) got 7 mismatches. More variables/measures don't mean better results. A single variable can produce pretty good result.

Mismatched IDs per Variable Set

The chart below shows all the mismatched IDs for each variable set. For example, the green dots are mismatched Iris-virginica. Note that ID 107 is mismatched in each of all 15 variable-set based clusters.

Conclusion

The best clustering is obtained using the Petal Width and Petal Length with fewest mismatches, under automatic option. Using Petal Width alone and pre-specified number of clusters, we got the same result. The quality of the clustering result is not proportional to the number of variables being used.

Feel free to download the companion workbook for more details.
0

Add a comment

(Refresh the page if you want to view the gif image multiple times. Or go to Tableau Public and click the button at the top-right corner.)

Jake and I collaborated on a dashboard. He told me that he learnt a way to create an in-place help page in Tableau. He first saw it at a conference somewhere and couldn't recall who the speaker was. So I am blogging here about it but the credit goes to somebody else. If anyone knows who the original creator is, leave a comment below.

The key idea is to float a semi transparent worksheet on top of the dashboard, where a help text box is strategically placed on top of each chart. This way, we can explain how to view each chart and what data points are important, etc. This worksheet is collapsible by a show/hide button. 

Below I would like to show how this worksheet can be constructed.

1. Sheet with a single data mark.

  • Double click the empty space in Marks panel and add two single quotes. Make the null pill a text label. This creates a single null mark.
  • Set the view as "Entire View"

2. Create an show/hide button

  • Go to the target dashboard
  • Drag a floating vertical container to the dashboard, making it cover all the area of interest.
  • Drag the Single Null Mark sheet and drop it into the above container. Hide the sheet title.
  • Create an open/close button for the container and place the button at the top-right corner.

3. Add annotations

  • Format the sheet background opacity as 70% in the layout manager             
  • Select area annotations and place them anywhere of interest. 
  • Write help text and format it to highlight important messages.  
  • The text can serve as functional guide and/or insight guide.

Here is an example. Feel free to download the workbook and explore. Click the "i" button at the top-right corner to view the in-place help. 

0

Add a comment

Blog Archive
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.