Hierarchical clustering is a well-known method of machine learning and analysis of data. It is employed to organize related data elements into clusters according to their features. The concept behind hierarchical clustering is to create an orderly cluster structure, where clusters of lower levels are joined to create bigger clusters on higher levels. The procedure continues till all the data are put in a single cluster or until a stopping criterion is satisfied. Data Science Course in Pune
The principle behind hierarchical clustering is explained in various steps:
- Initialization The initialization process is that each data point is regarded as a distinct cluster. Therefore, if there are many data points, it will have ‘n’ clusters in the beginning.
- Distance Calculation The next step is to compute how much distance, or the degree of dissimilarity between every group of clusters. There are many methods to determine the distances between the clusters like Euclidean distance Manhattan distance or cosine relatedity based on the type of information.
- The Merge step: Following the calculation of how far clusters are the two clusters closest to each other are combined to create a single cluster. The decision of merging clusters is dependent on various linkage requirements for example, one linkage, complete linkage, or even average linkage.
- Hierarchy construction As clusters get joined the dendrogram or hierarchy is built to symbolize the process of merging. This dendrogram shows the relationship between clusters at various layers of hierarchy. The vertical axis in the dendrogram is what is the dissimilarity or distance between clusters and the horizontal one depicts each data point, or clusters.
- Repeat 2 to 4 are repeated until the entire data point is put together into one cluster, or until a stopping requirement is reached. The stopping criterion can be an undetermined number of clusters, or a threshold distance.
There are two major ways to cluster hierarchically:
Agglomerative Hierarchical Clustering:
- This method starts with every data point being placed in an individual cluster, and iteratively joins the most close groups of clusters until one cluster that contains every data point is formed.
- In each step, the algorithm chooses the clusters that should be merged by the linkage requirements previously mentioned.
- Agglomerative hierarchical clustering can be more expensive to compute than divisive hierarchical clustering especially when large data sets are involved because it has to calculate distances between each pair of clusters for every stage.
Divisive Hierarchical Clustering:
- This method follows the reverse method, beginning by putting all the data contained in the same cluster and dispersing clusters in a recursive manner into smaller ones until each is comprised of just one point of data.
- Divisive hierarchical aggregation can be computationally less costly than agglomerative clustering since it requires only calculating distances between clusters at the start of the procedure, and not for each step.
- Finding the optimal divisive clustering may be difficult and might not always yield meaningful results.
Hierarchical clustering offers several benefits:
- There is no set quantity of clusters In contrast to K-means clustering and hierarchical clustering, hierarchical does not require the number of clusters before the start which makes it ideal for data exploration.
- Interpretability The hierarchical structure of clusters as represented by dendrograms gives an understanding of the relationships between clusters and data points which makes it easier to understand the results.
- Flexibility Hierarchical clustering may accommodate a variety of different types of distance metrics and data which makes it suitable for numerous issues.
Hierarchical Clustering has some drawbacks:
- Computability Complexity The process of aggregating hierarchical clustering is costly to compute, particularly with large datasets because it involves the calculation of distances between each pair of clusters for each step.
- Sensitive to Outliers and Noise Hierarchical clustering is sensitive to noise and outliers which could influence how the dendrogram is structured. This can result in suboptimal clustering performance.
- Problems in handling large data sets Hierarchical clustering might not be scalable for very large data sets due to its computational difficulty and memory requirement. Data Science Course in Pune
In the end, hierarchical aggregation is an effective method of finding natural clusters of data and illustrating the hierarchical connections between clusters. Through the process of iteratively combining or splitting clusters based on their similarity Hierarchical clustering can provide useful information about the structure and organization of large data sets. But, its efficiency is contingent on the selection of a distance measurement, linkage parameters, and the capacity to understand the dendrogram with the specific problem.