module HClust
Overview
The HClust
module provides methods for fast hierarchical
agglomerative clustering featuring efficient linkage algorithms.
Cluster analysis or clustering arrange a set of objects into
distinct groups or clusters such that the elements within a cluster
are more similar to each other than those in other clusters based on a
given criterion. The latter is defined as a measure of dissimilarity
between the objects, and it's usually encoded in a DistanceMatrix
.
Hierarchical agglomerative clustering builds a hierarchy of clusters,
where each object or element starts in its own cluster, and then they
are progressively merged forming a hierarchy among them. The
dissimilarities bewteen a newly-created cluster and all other clusters
are updated after each merging step. The dissimilarity between two
clusters is dictated by the chosen linkage criterion or rule. The
obtained hierarchy is often encoded as a Dendrogram
, which can be
cut at a given dissimilarity value to obtain flat clusters such that
the elements of a flat cluster have a cophenetic distance no greater
than the given value (see Dendrogram#flatten
).
The standard algorithm has a time complexity of Θ(N³), which is
implemented by the .primitive
method. However, optimal algorithms
are provided based on the work of Daniel Müllner [1], which
describes several optimizations over the standard algorithm
(implemented by the .generic
method), and fast methods for special
cases such as the minimum spanning tree (MST) method (.mst
) for
single linkage and nearest-neighbor-chain (NN-chain) method
(.nn_chain
) for average, complete, Ward's, and weighted linkage.
Thus, the best-case complexity can be reduced to Θ(N²), where, in
practice, the runtime of the general case is close to this.
The current implementation is heavily based on Müllner's own implementation found in the fastcluster C++ library, and parts of the implementation were also inspired by the kodama Rust crate (code for updating distances) and SciPy Python library (code for generating flat clusters).
The most relevant types and methods for practical usage are the following:
.cluster
performs hierarchical clustering of a set of elements using the pairwise dissimilarities returned by the given block..linkage
performs hierarchical clustering based on a pairwiseDistanceMatrix
returning aDendrogram
as output.Rule
is an enum that determines the linkage criterion or rule.Dendrogram
represents a step-wise dendrogram that encodes the hierarchy of clusters obtained from hierarchical clustering.Dendrogram#flatten
returns a list of flat clusters.
The available linkage rules are:
Rule::Average
or UPGMA method.Rule::Centroid
or UPGMC method.Rule::Complete
or farthest neighbor method.Rule::Median
or WPGMC method.Rule::Single
or nearest neighbor method.Rule::Ward
's minimum variance methodRule::Weighted
or WPGMA method.
See Rule
documentation for details.
Clustering example
Let's first define a list of random coordinates in 3D space as an example case:
coords = [
[-0.30818828, 2.70462841, 1.84344886],
[2.9666203, -1.39874721, 4.76223947],
[3.21737027, 4.09489028, -4.60403434],
[-3.51140292, -0.83953645, 2.31887739],
[2.08457843, 4.24960773, -3.91378835],
[2.88992367, -0.97659082, 0.75464131],
[0.43808545, 3.70042294, 4.99126146],
[-1.71676206, 4.93399583, 0.27392482],
[1.12130963, -1.09646418, 1.45833231],
[-3.45524705, 0.92812111, 0.15155981],
]
labels = (0...coords.size).to_a # for demonstration purposes
We'd like to group them by the proximity to each other defined by Euclidean distance:
def euclidean(u, v)
Math.sqrt (0...u.size).sum { |i| (u[i] - v[i])**2 }
end
Let's say we want to split the coordinates into groups with a distance
no more than 4. The easiest way is to use the .cluster
convenience
method:
clusters = HClust.cluster(labels, 4) { |i, j| euclidean(coords[u], coords[v]) }
clusters.size # => 3
clusters # => [[0, 3, 6, 7, 9], [1, 5, 8], [2, 4]]
The method receives the elements to be clustered, a distance cutoff, and a block that receives two of the elements and must return the dissimilarity between them. We observe that the coordinates can be grouped into 3 distinct clusters containing 5, 3, and 2 elements, respectively. The order of the clusters is arbitrary and depends on the order of the elements.
Alternatively, one can set the maximum number of clusters to be
generated instead of a distance cutoff using the named argument
into:
:
clusters = HClust.cluster(labels, into: 2) { |i, j| euclidean(coords[i], coords[j]) }
clusters.size # => 2
clusters # => [[0, 1, 3, 5, 6, 7, 8, 9], [2, 4]]
As stated above, the linkage rule dictates how the dissimilarities are updated upon merging two clusters during the clustering procedure. Therefore, the clustering output will depend on the selected rule. In turn, the choice of the linkage rule depends both on the problem domain and performance requirements. By default, the single linkage is used throughout the code but it can be given as the third (optional) argument to the cluster methods if required.
clusters = HClust.cluster(labels, 2, :centroid) { |i, j|
euclidean(coords[i], coords[j])
}
clusters.size # => 5
clusters # => [[0, 7], [1, 5, 8], [2, 4], [3, 9], [6]]
Note the different number of clusters and composition obtained with the centroid linkage comparted to the previous result using the single linkage.
Advanced usage
Using the .cluster
methods is enough in most cases, albeit obtaining
the dendrogram can be useful for visualization purposes or testing
different clustering arguments without recomputing the
dissimilarities.
Under the hood, the .cluster
methods construct a DistanceMatrix
with the given block, invoke the .linkage
method, and then call
Dendrogram#flatten
on the obtained dendrogram with the given
argument. The latter returns an array of clusters, each containing a
list of indexes that can be used to fetch the original elements. The
equivalent code to the above example would be:
dism = DistanceMatrix.new(coords.size) { |i, j| euclidean(coords[i], coords[j]) }
dendrogram = linkage(dism)
clusters = dendrogram.flatten(4)
The dendrogram represents the hierarchy as a series of merge steps, which contain the merged clusters and computed dissimilarity. For instance, let's see what the dendrogram looks like:
pp dendrogram.steps.map { |s| [s.clusters[0], s.clusters[1], s.distance, 0.0] }
will print:
# cluster1, cluster2, dissimilarity, size (unused)
[[2, 4, 1.3355127737375514, 0.0],
[5, 8, 1.9072352420201895, 0.0],
[3, 9, 2.7973259058854163, 0.0],
[0, 7, 3.068805125647255, 0.0],
[6, 13, 3.384856775520168, 0.0],
[1, 11, 3.796359969884454, 0.0],
[12, 14, 3.9902939298123274, 0.0],
[15, 16, 4.079225895869359, 0.0],
[10, 17, 5.696974476555648, 0.0]]
The output can be copied into a Python terminal and visualized using the dendrogram function in SciPy or similar software. It would look something like:
| ________________
5 | | |
| | |
4 | | _______|_______
| | __|__ ____|___
3 | | | | | __|__
| | | | _|_ | _|_
2 | | | _|_ | | | | |
| | | | | | | | | |
1 | _|_ | | | | | | | |
| | | | | | | | | | |
0 | 2 4 1 5 8 3 9 6 0 7
Using this graph, one can deduce the optimal cutoff for generating flat clusters, where a cutoff = 4 would produce indeed two clusters as shown above.
Defined in:
hclust.crhclust/cluster.cr
hclust/dendrogram.cr
hclust/version.cr
Constant Summary
-
VERSION =
{{ (`shards version \"/home/runner/work/hclust/hclust/src/hclust\"`).chomp.stringify }}
Class Method Summary
-
.centroids(elements : Indexable(T), count : Int, rule : Rule = :single, & : T, T -> Float64) : Array(T) forall T
Returns the centroids of the count clusters or fewer of elements using the linkage rule rule based on the distances computed by the given block.
-
.centroids(elements : Indexable(T), *, cutoff : Number, rule : Rule = :single, & : T, T -> Float64) : Array(T) forall T
Returns the centroids of the clusters of elements using the linkage rule rule based on the distances computed by the given block.
-
.cluster(elements : Indexable(T), count : Int, rule : Rule = :single, & : T, T -> Float64) : Array(Array(T)) forall T
Clusters elements into count clusters or fewer using the linkage rule rule based on the distances computed by the given block.
-
.cluster(elements : Indexable(T), *, cutoff : Number, rule : Rule = :single, & : T, T -> Float64) : Array(Array(T)) forall T
Clusters elements using the linkage rule rule based on the distances computed by the given block.
-
.generic(dism : DistanceMatrix, rule : Rule) : Dendrogram
Perform hierarchical clustering based on the distances stored in dism using a fast generic linkage algorithm with the given linkage rule.
-
.linkage(dism : DistanceMatrix, rule : Rule, reuse : Bool = false) : Dendrogram
Returns the hierarchical clustering based on the pairwise distances dism using the linkage rule rule.
-
.mst(dism : DistanceMatrix) : Dendrogram
Perform hierarchical clustering based on the distances stored in dism using the minimum spanning tree (MST) algorithm.
-
.nn_chain(dism : DistanceMatrix, rule : ChainRule) : Dendrogram
Perform hierarchical clustering based on the distances stored in dism using the nearest-neighbor-chain (NN-chain) algorithm with the given linkage rule.
-
.primitive(dism : DistanceMatrix, rule : Rule) : Dendrogram
Perform hierarchical clustering based on the distances stored in dism using the classic algorithm with the given linkage rule.
Class Method Detail
Returns the centroids of the count clusters or fewer of elements using the linkage rule rule based on the distances computed by the given block.
The centroid of a cluster is defined as the element with the smallest average distance to all other members of the cluster.
Returns the centroids of the clusters of elements using the linkage rule rule based on the distances computed by the given block. The clusters are generated such that the cophenetic distance between any two elements in a cluster is less than or equal to cutoff.
The centroid of a cluster is defined as the element with the smallest average distance to all other members of the cluster.
Clusters elements into count clusters or fewer using the linkage rule rule based on the distances computed by the given block.
Clusters elements using the linkage rule rule based on the distances computed by the given block. The clusters are generated such that the cophenetic distance between any two elements in a cluster is less than or equal to cutoff.
Perform hierarchical clustering based on the distances stored in dism using a fast generic linkage algorithm with the given linkage rule.
The so-called generic algorithm published by Müllner, D. [1] includes several optimizations over the classic hierarchical clustering algorithm, reducing the best-case complexity from Θ(N³) to Θ(N²). In practice, it is considerably faster than the standard method. This is mainly due to the algorithm keeps track of the nearest neighbors of clusters in a priority queue to speed up the repeated minimum searches.
This algorithm can deal with inversions in the dendrogram, so it can
be used with any linkage rule including Rule::Centroid
and
Rule::Median
.
The merge steps are encoded as an unordered Dendrogram
, which is
sorted prior to be returned.
The current implementation is described in section 3.1 of the Müllner's article [1].
NOTE Prefer to use the .linkage
method since it provides a general
interface and picks the best algorithm depending on the linkage rule.
Returns the hierarchical clustering based on the pairwise distances dism using the linkage rule rule.
This method simply selects and invokes the optimal algorithm based on the given linkage rule as follows:
- The minimum spanning tree (MST) algoritm for the
Rule::Single
linkage rule (see.mst
). - The nearest-neighbor-chain (NN-chain) algoritm for the
Rule::Complete
,Rule::Average
,Rule::Weighted
, andRule::Ward
linkage rules (.nn_chain
). - The generic algoritm for the
Rule::Centroid
andRule::Median
linkage rules (.generic
).
If reuse is true
, the distance matrix dism will be forwarded
directly to the underlying method, and be potentially modified. If
reuse is false
, a copy will be created first and then forwarded.
This can be used to prevent a potentially large memory allocation when
the distance matrix will not be used after clustering.
Perform hierarchical clustering based on the distances stored in dism using the minimum spanning tree (MST) algorithm.
The MST algorithm keeps track of the distances to the nearest neighbor
for each cluster after every merge step, which leads to a significant
speed up as obtaining the next pair of nearest clusters is very
efficient. By definition, this algorithm can only be used with the
Rule::Single
linkage rule.
The merge steps are encoded as an unordered Dendrogram
, which is
sorted prior to be returned.
The current implementation is described in section 3.3 of the Müllner's article [1], which includes several optimizations over the classic implementation.
NOTE Prefer to use the .linkage
method since it provides a general
interface and picks the best algorithm depending on the linkage rule.
Perform hierarchical clustering based on the distances stored in dism using the nearest-neighbor-chain (NN-chain) algorithm with the given linkage rule.
The NN-chain algorithm follows paths in the nearest neighbor graph of
the clusters to find a pair of clusters that are nearest neighbors of
each other, which are merged into a new cluster. The algorithm uses a
stack data structure (Deque
) to store the paths, which can lead to a
speed up by re-using parts of the existing path efficiently.
By definition, the NN-chain algorithm can only be used with the
following linkage rules: Rule::Single
, Rule::Complete
,
Rule::Average
, Rule::Weighted
, and Rule::Ward
. Consequently,
this method accepts a ChainRule
enum (not Rule
), which only
contains these methods to ensure safety during compilation.
The merge steps are encoded as an unordered Dendrogram
, which is
sorted prior to be returned.
The current implementation is described in section 3.2 of the Müllner's article [1].
NOTE Prefer to use the .linkage
method since it provides a general
interface and picks the best algorithm depending on the linkage rule.
Perform hierarchical clustering based on the distances stored in dism using the classic algorithm with the given linkage rule.
The classic or primitive algorithm iteratively finds a pair of clusters that are nearest neighbors of each other, which are merged into a new cluster. Then, the distances stored in dism are updated by computing the distances to the newly-created cluster according to the linkage rule. The procedure is repeated for N - 1 times, where N is the number of elements or observations. Since all pairwise distances are searched in each iteration, the time complexity of this algorithm is Θ(N³).
This algorithm can deal with inversions in the dendrogram, so it can
be used with any linkage rule including Rule::Centroid
and
Rule::Median
.
The merge steps are encoded as an unordered Dendrogram
, which is
sorted prior to be returned.
The current implementation is described in section 2.4 of the Müllner's article [1].
WARNING This method is painfully slow and should not be used for
production. It is only used as reference to test other methods. Prefer
to use the .linkage
method since it provides a general interface and
picks the best algorithm depending on the linkage rule.