A cheatsheet to Clustering algorithms

Sairam Penjarla
Analytics Vidhya
Published in
3 min readJul 15, 2021

--

Clustering algorithms are one of the most popular algorithm used by machine learning practitioners across the world for classification problems. The most popular among all clustering algorithms is K-means clustering, but in this story we shall see how it works, what are some other available options, when to use which, How do they all differ, and much more.

K-means

  • It requires us to know what are the possible number of clusters that can be formed before even applying the algorithm. It assumes that clusters are convex shaped
  • Many of the documented pages on k-means including the scikit learn’s official documentation page uses duplicate data (made-up) data with just one feature. Hence, the number of clusters can be easily identified visually with help of a scatter plot.
  • But in real world, it is very very unlikely that you might find data with just one feature. Hence, knowing the number of clusters is a bit difficult.
  • There are a few techniques such as asking the stakeholder, elbow method, Silhouette coefficient which help us in identifying the number of clusters.
  • It uses Within-Cluster-Sum-of-Squares (WCSS) as its objective function (loss function in deep learning terms) to improve itself at every iteration.
  • A variation of K-means clustering is Mini Batch K-Means clustering. It uses a sample of input data. other than that, everything else is the same. The accuracy of this model is slightly less compared your regular K-means clustering.
Comparison of the K-Means and MiniBatchKMeans clustering algorithms

Affinity Propagation

  • In every iteration , it sends messages between pairs of samples until convergence.
  • this message includes the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs.

Mean shift

  • It works by updating the centroids of each cluster.These centroids are then filtered to get rid of similar ones or near-duplicates and returns a set of centroids.

Spectral clustering

  • It uses the concept of affinity matrix followed by clustering.
  • It works well for a small number of clusters, but is not advised for many clustering the components of the eigenvectors.
  • Generally used for image segmentation.

Hierarchical clustering

  • It’s a little similar to a random forest. It uses nested clusters by merging or splitting them successively and this hierarchy is represented as a binary tree.
  • Agglomerative Clustering is an exact replica of Hierarchical clustering but with a bottom-up approach. It recursively merges the pair of clusters that minimally increases a given linkage distance.

DBSCAN

  • While K-means assumes that clusters are convex shaped, DBSCAN doesn’t care much about the shape but rather the density of data points. Hence the clusters formed can be of any shape.
  • It uses eps value which is the maximum distance between two samples for one to be considered as in the neighbourhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
  • Optics is a generalised version of DBSCAN that has an eps requirement in a range rather than a single value.

BIRCH

  • The Birch builds a tree called the Clustering Feature Tree. which essentially makes it a Hierarchical clustering.
  • The BIRCH algorithm has two parameters, the threshold and the branching factor. The branching factor limits the number of subclusters in a node and the threshold limits the distance between the entering sample and the existing subclusters.
  • This algorithm can be viewed as an instance or data reduction method, since it reduces the input data to a set of subclusters which are obtained directly from the leaves of the CFT

Gaussian mixture models

  • There are four types of models in gaussian mixture. Spherical, diagonal, tied or full covariance.
  • It is the fastest algorithm for learning mixture models
  • They implement the expectation-maximization (EM) algorithm

--

--

Sairam Penjarla
Analytics Vidhya

Looking for my next opportunity to make change in a BIG way