Learning Objectives
- Model Evaluation
- External Validation Techniques
- Internal Validation Techniques
Model Evaluation
When you talk about validating or evaluating a machine learning model, it’s important to know that the validation techniques employed not only help in measuring performance, but also go a long way in helping you understand your model on a deeper level. This is the reason why a significant amount of time is devoted to the process of result validation and model evaluation while building a machine-learning model.
Result validation is a very crucial step as it ensures that our model gives good results not just on the training data but, more importantly, on the live or test data as well.
How is evaluation here different from supervised learning?
In case of supervised learning, evaluation is mostly done by measuring the performance metrics such as accuracy, precision, recall, AUC, etc. on the training set and the holdout sets. Such performance metrics help in deciding model viability. Then we may tune the hyper parameters and repeat the same process till we achieve the desired performance.
However, in case of unsupervised learning, the process is not very straightforward as we do not have the ground truth (the labels). In the absence of labels, it is very difficult to identify how results can be validated.
It is thus overall difficult to evaluate the quality of an unsupervised algorithm due to the absence of an explicit goodness metric as used in supervised learning.
Existing Domain Knowledge
Let’s say we have a problem at hand to cluster different songs in Spotify together on the basis of genres, to create different playlists. After our work is done, how do we know it is good enough?
We can verify the results of our clustering exercise through our existing knowledge of the data (for example, knowing that genre A and genre B of music are similar so if those clusters are located near, it should be correct).
But what if we don’t have such prior knowledge of our data? What if the data isn’t even labelled (as is the case in many real-life clustering cases)? Even if it is, what if these labels are initially meaningless to us? There are plenty of artists that we’ve never even heard of, and if we’re trying to group thousands of tracks then it’s clearly impractical to manually verify every cluster. In these cases, we need some kind of mathematical measure for how ‘successful’ our clustering has been.
Example
Coming back to our objective of creating clusters on the basis of genres for Spotify playlists. We have learned a number of clustering algorithms till now so we can try each of them out to create clusters.
So let’s say we implemented 4 algorithms -
Algo 1: Hierarchical Agglomerative Clustering with ward linkage
Algo 2: Hierarchical Agglomerative Clustering with complete linkage
Algo 3: Hierarchical Agglomerative Clustering with average linkage
Algo 4: K-Means Clustering
Now, we need to know which one is performing the best on our data.
Evaluation Techniques
There are two classes of statistical techniques to validate results for cluster learning. These are:
- External validation
- Internal validation
External Validation
Metrics where original labels are required to evaluate clusters.
This type of validation can be carried out if true cluster labels are available.
In this approach we will have a set of clusters S = {C1, C2, C3,..., Cn } which have been generated as a result of some clustering algorithm. We will have another set of clusters P = {D1, D2, D3,..., Dm} which represent the true cluster labels on the same data. The idea is to measure the statistical similarity between the two sets. A cluster set is considered as good if it is highly similar to the true cluster set.
In order to measure the similarity between S and P, we label each pair of records from data as Positive if the pairs belong to the same cluster in P else Negative. Similar exercise is carried out for S as well. We then compute a confusion matrix between pair labels of S and P which can be used to measure the similarity.

- TP: Number of pairs of records which are in the same cluster, for both S and P.
- FP: Number of pairs of records which are in the same cluster in S but not in P.
- FN: Number of pairs of records which are in the same cluster in P but not in S.
- TN: Number of pairs of records which are not in the same cluster S as well as P.
On the above 4 indicators, we can calculate different metrics to get an estimate for the similarity between S (cluster labels generated by unsupervised method) and P (true cluster labels). Some example metrics which could be used are Precision, Recall and F1-score.
Matrix Representation
We can represent our results in a matrix, showing what percentage of each playlist’s songs have ended up in each cluster.
If the clustering had been perfect, we’d expect each row and each column of the matrix to contain exactly one entry of 100% (it needn’t be in a diagonal, of course, since the cluster name assignment is arbitrary).
Matrix Representation for Algo 1

The default ‘ward’ linkage, which tries to minimise variance within clusters, has done a good job with all four genres, though there is some leakage into cluster B i.e. in the 2nd column, there are entries in multiple clusters and not just one.
Matrix Representation for Algo 2

‘Complete’ linkage has clearly not worked well. It has placed a lot of the dataset into cluster A. Cluster C consists of one single rap song.
Matrix Representation for Algo 3

‘Average’ linkage has similar issues to ‘Complete’ linkage. Many data points have been placed into a single cluster, with two clusters consisting of a single song.
Matrix Representation for Algo 4

As with the HAC algorithm using ‘ward’ linkage, K-Means clustering has done a good job across most of the algorithms, with some jazz and rap songs being ‘mistaken’ for K-Pop.
Matrix Representation
While these matrices are good for ‘eyeballing’ our results, they’re far from mathematically rigorous. Let’s consider some metrics to actually help us assign a number to our cluster quality.
Adjusted Rand Index
The Adjusted Rand Index attempts to express what proportion of the cluster assignments are ‘correct’. It computes a similarity measure between two different clustering methods by considering all pairs of samples, and counting pairs that are assigned in the same or different clusters predicted, against the true cluster labels, adjusting for random chance.
This (as well as the other metrics we’ll consider) can be evaluated using Scikit-Learn.

The Adjusted Rand index is bounded between -1 and 1. Closer to 1 is good, while closer to -1 is bad.

We see that K-Means and Ward Linkage have a high score. We’d expect this, based on the matrices we previously observed.
Fowlkes Mallows Score
The Fowlkes Mallow Score is similar to Adjusted Rand Index, in as much that it tells you the degree to which cluster assignments are ‘correct’.
In particular, it calculates the geometric mean (special type of average where we multiply the numbers together and then take a square root (for two numbers)) between precision and recall. It’s bounded between 0 and 1, with higher values being better.


We similar rankings to the Adjusted Rand Index — which we would expect, given that they’re two methods of trying to answer the same question.
More external validation techniques
A few more external validation techniques include:
- Jaccard Similarity
- Mutual Information
Drawbacks of External Validation
Business/User validation, as the name suggests, requires inputs that are external to the data.
The idea is to generate clusters on the basis of the knowledge of subject matter experts and then evaluate similarity between the two sets of clusters i.e. the clusters generated by ML and clusters generated as a result of human inputs.
However, in most of the cases, such knowledge is not readily available. Also, this approach is not very scalable. Hence, in practice, external validation is usually skipped.
Internal validation
Metrics where original labels are not required to evaluate clusters.
Why Internal Validation?
Given that dealing with unlabelled data is one of the main use cases of unsupervised learning, we require some other metrics that evaluate clustering results without needing to refer to ‘true’ labels.
How Internal Validation?
Most of the literature related to internal validation for cluster learning revolves around the following two types of metrics –
- Cohesion within each cluster
- Separation between different clusters

Intuition
Suppose we have the following results from 3 separate clustering analysis.



Evidently, the ‘tighter’ we can make our clusters, the better. Is there some way to give a number to this idea of ‘tightness’?
Internal Validation Metrics
In practice, instead of dealing with two metrics, several measures are available which combine cohesion and coupling into a single measure. Few examples of such measures are:
- Silhouette coefficient
- Calisnki-Harabasz coefficient
- Dunn index
- Xie-Beni score
- Hartigan index
Silhouette Score
The Silhouette Score attempts to describe how similar a data point is to other data points in its cluster, relative to data points not in its cluster (this is aggregated over all data points to get the score for an overall clustering). In other words, it thinks about how ‘distinct’ the clusters are in space — indeed one could use any measure of ‘distance’ to calculate the score.
It is bounded between -1 and 1. Closer to -1 suggests incorrect clustering, while closer to +1 shows that each cluster is very dense.


We see that none of the clusters have super-high Silhouette Scores. Interestingly, we see that the Average Linkage clusters have the highest scores. Remember, however, that this algorithm produced two clusters that each contained just a single data point, which is unlikely to be a desirable outcome in a real-world situation (a lesson that you often can’t rely on a single metric to make decisions about the quality of an algorithm!).
Calinski Harabaz Index
The Calinski Harabaz Index is the ratio of the variance of a data point compared to points in other clusters, against the variance compared to points within its cluster.
Since we want this first part to be high, and the second part to be low, a high CH index is desirable. Unlike other metrics we have seen, this score is not bounded.


Here we see that our K-Means and Ward Linkage algorithms score highly. The Complete and Average linkage algorithms are punished for having one or two large clusters, which will have a higher level of intra-variance.
Additional Exploration
You might want to explore a technique called ‘Twin-Sample Validation’.
It should be used in combination with internal validation and can prove to be highly useful in case of time-series data where we want to ensure that our results remain same across time. (If you want to learn more about time-series analysis, check out our Course on Introduction to Time Series Analysis.)
Conclusion
So this brings us to the end of this unit.
This was written with the sole purpose to cover the most important and most commonly used machine learning model evaluation metrics and bring some clarity towards the meaning of these evaluation metrics. I hope this might have helped you in some way and motivated you to pick up the right metric for your use-case in order to evaluate how good a machine learning model you have built.