Wednesday, October 7, 2020

Analysis of Tableau's Multi-Variable Clustering Algorithm

Following previous post on Iris Classification via Clustering in Tableau, we are going to analyse and compare the quality of clustering in Tableau by using 1,2,3 and 4 variables based on the Iris data set. We will use both automatic clustering and pre-specified number of clusters.

In the Iris data set, we know there are 3 clusters or classes of flowers. In some of the cases with a combination of measures/variables, Tableau can figure out automatically there are 3 clusters. We don't need to tell it a priori. In other cases, Tableau will find more or less than 3 by automatic clustering. Then in order to compare with all others, we will pre-specify the number of clusters to be 3.

Variable and Clustering Settings

- Create a scatter plot using Petal Length, Petal Width and ID (See previou post)

- Go the Analytics tab and drag Cluster to the canvas to create clusters

- In the Variable setting, we can add/remove variables. The default variables are the ones in view, like Petal Length and Petal Width. But they don't have to be the ones used for clustering. We can add 1 to 4 variables as there are in total 4 measures in the data set.

- In the Cluster setting, we can either leave it to be automatic or specify the number of clusters. 

Clustering

There are two options for clustering: automatic or pre-specify. The produced cluster can be moved to the dimension pane as groups.

The symbol before the cluster name means geographic dimension. In reality, each cluster is like a group.

Automatic Clustering

We found that there are 9 combinations of variables where the automatic clustering produces the expected number of clusters: 3

Pre-Specify the Number of Clusters

In other cases than the above, the automatic option does not produce the expected number of clusters. Since we know there are 3 clusters and for the purpose of comparison, we can set the number of clusters to be 3. The cases include all the possible combinations of measures from 1 variable to 4 variables.

Variable Sets

Given 4 possible measures, we have 15 possible sets of variables as follows.

We have a shortened notion for the above sets, for example, PLSW = Petal Length / Sepal Width 
where P: Petal S: Sepal L: Length W: Width. (*) means pre-specified number of clusters.

Parameterized Scatter Plot

We use a single scatter plot to show the distribution of the 150 data marks with Petal Width and Petal Length as axis. The cluster is controlled by a parameter. Depending on the variable set, the resulting cluster (groups of IDs) is selected to color the data marks where some are mismatches and others are matches. A filter allows us to view only the matched or mismatched ones.

Quality of the Clustering

The fewer the mismatches the better the variable set under the algorithm. Using Petal Length and Petal Width (PLPW) we got the best match for automatic clustering: 6 mismatches out of 150 records. Using Petal Width (PW) alone, we got the best match with pre-specified number of clusters: 6 mismatches out of 150 records. Petal Length (PL) got 7 mismatches. More variables/measures don't mean better results. A single variable can produce pretty good result.

Mismatched IDs per Variable Set

The chart below shows all the mismatched IDs for each variable set. For example, the green dots are mismatched Iris-virginica. Note that ID 107 is mismatched in each of all 15 variable-set based clusters.

Conclusion

The best clustering is obtained using the Petal Width and Petal Length with fewest mismatches, under automatic option. Using Petal Width alone and pre-specified number of clusters, we got the same result. The quality of the clustering result is not proportional to the number of variables being used.

Feel free to download the companion workbook for more details.

No comments:

Post a Comment