Clustering in machine learning
Valérie Bécaert Valérie Bécaert
February 25 5 min

Clustering in machine learning

Humans have searched for patterns long before the existence of AI. It’s fundamental to the inquisitive nature of the human, to notice similarities in the world and discover new meaning and hidden knowledge. But now we’ve got a new and powerful helper that can handle the incredible volume of data being created in the 21st century and interpret it using techniques such as clustering in machine learning.

Clustering is a fascinating field of data science. AI-driven cluster analysis is being applied by all kinds of organizations – from businesses to scientific researchers – to glean valuable insights across every facet of life. In this article, we’ll explore how clustering works, how it originated, and how it’s being used.

Spot the similarity

Clustering refers to grouping data together according to interesting/useful similarities. The concept is often credited to anthropologists Alfred Kroeber and Harold Driver. Their 1932 paper, Quantitative Expression of Cultural Relationships, is concerned with how certain cultural traits, (such as religious beliefs or styles of architecture) cross ethnicities and societies and, in some cases these traits are found in particular clusters of cultures. Essentially, they wanted to use pattern recognition to analyze anthropological data and gain insights that span swathes of humanity.

It wasn’t long until cluster analysis was adopted by another area of study: psychology. Educational psychologist Joseph Zubin referred to it as “a technique for measuring like-mindedness” in his 1938 paper of the same name. He described it as a method of subdividing people into groups who think the same way in terms of certain social criteria. Clustering was quickly adopted by contemporaries such as behavior psychologist Robert Tryon and pioneering personality theorist Raymond Cattell. From there, cluster studies spread throughout the rest of the sciences.

Types of clustering algorithms

Today’s machine learning systems use a range of methods to break data down into groups and subgroups in the search for meaning. Here are a few of the key types of clustering algorithm you need to know:

  • Distance between datapoints. When datapoints are plotted in a space (let’s imagine a scatter graph) the distances between them are calculated. The areas of data that are found to be clustered closest together are then defined as groups. This is a relatively simple method but may not be suitable for big data analysis, because it requires every data point to be compared to every other – a large amount of processing. This method is also known as hierarchy clustering or connectivity-based clustering, because it works with data hierarchies – the connections between different data points.
  • Density of points within an area. These clusters are defined by areas of the space that are densely populated with datapoints, separated from each other by sparser areas. Datapoints outside the clusters are disregarded as “noise”, and therefore aren’t accounted for in the analysis. The algorithm works on the basis that within a defined radius, there must be a certain density of data for it to be considered a cluster. Datapoints close enough to the cluster to meet its criteria are then accepted within the cluster, until it reaches an area where the distance to the next point is far enough that it cannot be accepted. The cluster is then set and can be analyzed.
  • Probability distribution-based clustering. The algorithm works out the probability of whether datapoints belong in a cluster using a chosen distribution model. It uses centroids to define each cluster’s center point and categorizes the data surrounding them accordingly. The further a datapoint is from the cluster’s center, the less probability it belongs in that cluster.
  • Centroid or k-means clustering. This method randomly selects a set of k datapoints as centroids. The surrounding datapoints closest to them are then assigned to that centroid’s cluster. The mean (average) of all the datapoints within the cluster is computed and the centroid re-positioned in the center of the cluster. This process is repeated until all centroids stop moving and the clusters are firmly defined. Moving forward, any new data will be assigned to these clusters, and the centroids are recalculated. The number of centroids used is referred to as “k” – hence “k-means.”

Unlocking business insights

With today’s machine learning technology and processing power, cluster analysis can be applied to immense volumes of data with a great degree of sophistication.

Clustering techniques have long been used for market insights and strategy. Applying them to the rich customer and transaction data available to retailers results in deep, nuanced and highly effective market segmentation. Recommendation engines that constantly improve can pinpoint the products and services consumers want and need, based on patterns in their circumstances and spending.

Masses of possibilities

The more advanced clustering in machine learning gets, the greater the insights it can deliver – for commerce, science and society. AI systems are even detecting clusters of cancer symptoms, so that treatment can begin as soon as possible.

The study of clusters began as a way to understand humanity, and it continues to help us comprehend our world in ways previously unimagined.