Clustering

Cluster analysis or simply clustering is the process of partitioning a set of data objects (or observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis can be referred to as a clustering. In this context, different clustering methods may generate different clusterings on the same data set. The partitioning is not performed by humans, but by the clustering algorithm. Hence, clustering is useful in that it can lead to the discovery of previously unknown groups within the data.ApplicationsCluster analysis has been widely used in many applications such as business intelligence, image pattern recognition, web search, biology, and security. In business intelligence, clustering can be used to organize a large number of customers into groups, where customers within a group share strong similar characteristics. In image recognition, clustering can be used to discover clusters or “subclasses” in handwritten character recognition systems. In web searches, clustering can be used to organize the search results into groups and present the results in a concise and easily accessible way.

Clustering Algorithms

Clustering algorithms can be categorized based on their cluster model. The following overview will only list the two most prominent examples of clustering algorithms: Partitioning clustering given a set of n objects, partitioning clustering constructs k partitions of the data, where each partition represents a cluster and k ;= n. That is, it divides the data into k groups such that each group must contain at least one object. In other words, partitioning clustering conduct one-level partitioning on data sets.Hierarchical clusteringWhile partitioning clustering meet the basic clustering requirement of organizing a set of objects into a number of exclusive groups, in some situations we may want to partition our data into groups at different levels such as in a hierarchy. A hierarchical clustering method works by grouping data objects into a hierarchy or “tree” of clusters, i.e. it creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all the groups are merged into one (the topmost level of the hierarchy), or a termination condition holds.

Agglomerative vs Divisive approach

The divisive approach, also called the top-down approach, starts with all the objects in the same cluster. In each successive iteration, a cluster is split into smaller clusters, until eventually each object is in one cluster, or a termination condition holds.Representing data objects in the form of a hierarchy is useful for data summarization and visualization. For example, as the manager of human resources at ABC Electronics one may organize their employees into major groups such as managers, secretaries and staff. One can further divide these groups into smaller subgroups. For instance, the general group of staff can be further divided into subgroups of senior officers, officers, and trainees. All these groups form a hierarchy. We can easily summarize or characterize the data that are organized into a hierarchy, which can be used to find, say, the average salary of managers and of officers.Hierarchical clustering treats each data point as a singleton cluster, and then successively merges clusters until all points have been merged into a single remaining cluster. A hierarchical clustering is often represented as a dendrogram, a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.Figure: Hierarchical clustering example.

Let us apply hierarchical clustering to the data set shown in Figure (a). Figure 10.8(b) shows the dendrogram using single linkage. Figure 10.8(c) shows the case using complete linkage, where the edges between clusters {A,B, J ,H} and {C,D,G,F,E} are omitted for ease of presentation. This example shows that by using single linkages we can find hierarchical clusters defined by local proximity, whereas complete linkage tends to find clusters opting for global closeness.In single-link (or single linkage) hierarchical clustering, we merge in each step the two clusters whose two closest members have the smallest distance (or: the two clusters with the smallest minimum pairwise distance).In complete-link (or complete linkage) hierarchical clustering, we merge in each step the two clusters whose merger has the smallest diameter (or: the two clusters with the smallest maximum pairwise distance).Hierarchical clustering methods can encounter difficulties regarding the selection of merge or split points. Such a decision is critical, because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters. It will neither undo what was done previously, nor perform object swapping between clusters. Thus, merge or split decisions, if not well chosen, may lead to low-quality clusters. Moreover, the methods do not scale well because each decision of merge or split needs to examine and evaluate many objects or clusters.A promising direction for improving the clustering quality of hierarchical methods is to integrate hierarchical clustering with other clustering techniques, resulting in multiple-phase (or multiphase) clustering. One such algorithm is the BIRCH algorithm. It begins by partitioning objects hierarchically using tree structures, where the leaf or low-level nonleaf nodes can be viewed as “microclusters” depending on the resolution scale. It then applies other clustering algorithms to perform macroclustering on the microclusters.

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

BIRCH is designed for clustering a large amount of numeric data by integrating hierarchical clustering (at the initial microclustering stage) and other clustering methods such as iterative partitioning (at the later macroclustering stage). It overcomes the two difficulties in agglomerative clustering methods: (1) scalablity and (2) the inability to undo what was done in the previous step.OverviewFigure: BIRCH OverviewThe above figure presents the overview of BIRCH. It consists of four phases: (1) Loading, (2) Optional Condensing, (3) Global Clustering, and (4) Optional RefiningPhase 1: loading the main task of Phase 1 is to scan and build an initial in-memory CF-tree using the given amount of memory and recycling space on disk. This CF-tree tries to reflect the clustering information of the dataset in as much detail as possible subject to reflect the clustering information of the dataset in as much detail as possible subject to the memory limits. With crowded data points grouped into subclusters, and sparse data points removed as outliers, this phase creates an in-memory summary of the data.BIRCH uses the notions of clustering feature to summarize a cluster, and clustering feature tree (CF-tree) to represent a cluster hierarchy. These structures help the clustering method achieve good speed and scalability in large or even streaming databases, and also make it effective for incremental and dynamic clustering of incoming objects.Consider a cluster of n d-dimensional data objects or points. The clustering feature (CF) of the cluster is a 3-D vector summarizing information about clusters of objects. It is defined asCF = {n, LS, SS}where LS is the linear sum of the n points (i.e., i=1nxi), and SS is the square sum of the data points (i.e., i=1nxi2).A clustering feature is essentially a summary of the statistics for the given cluster. Using a clustering feature, we can easily derive many useful statistics of a cluster. For example, the cluster’s centroid, x0, radius, R, and diameter, D, arex0=i=1nxin=LSnR=i=1nxi-x02n=nSs-2LS2+nLSn2D=i=1nj=1n(xi-xj)2nn-1=2nSS-2LS2nn-1Here, R is the average distance from member objects to the centroid, and D is the average pairwise distance within a cluster. Both R and D reflect the tightness of the cluster around the centroid.Summarizing a cluster using the clustering feature can avoid storing the detailed information about individual objects or points. Instead, we only need a constant size of space to store the clustering feature. This is the key to BIRCH efficiency in space. Moreover, clustering features are additive. That is, for two disjoint clusters, C1 and C2, with the clustering features CF1 = {n1,LS1,SS1} and CF2 = {n2,LS2,SS2}, respectively, the clustering feature for the cluster that formed by merging C1 and C2 is simplyCF1 + CF2 = (n1 + n2, LS1 + LS2, SS1 + SS2}.A CF-tree is height-balanced tree that stores the clustering features for a hierarchical clustering.

By definition, a nonleaf node in a tree has descendants or “children.” The nonleaf nodes store sums of the CFs of their children, and thus summarize clustering information about their children. A CF-tree has two parameters: branching factor, B, and threshold, T. The branching factor specifies the maximum number of children per nonleaf node. The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree. These two parameters implicitly control the resulting tree’s size.

Phase 1

For Phase 1, the CF-tree is built dynamically as objects are inserted. Thus, the method is incremental. An object is inserted into the closest leaf entry (subcluster). If the diameter of the subcluster stored in the leaf node after insertion is larger than the threshold value, then the leaf node and possibly other nodes are split. After the insertion of the new object, information about the object is passed toward the root of the tree. The size of the CF-tree can be changed by modifying the threshold. If the size of the memory that is needed for storing the CF-tree is larger than the size of the main memory, then a larger threshold value can be specified and the CF-tree is rebuilt. The rebuild process is performed by building a new tree from the leaf nodes of the old tree. Thus, the process of rebuilding the tree is done without the necessity of rereading all the objects or points. This is similar to the insertion and node split in the construction of BC-trees.

Therefore, for building the tree, data has to be read just once. Some heuristics and methods have been introduced to deal with outliers and improve the quality of CF-trees by additional scans of the data. Once the CF-tree is built, any clustering algorithm, such as a typical partitioning algorithm, can be used with the CF-tree in Phase 3.

Phase 2: Condensing

Phase 2 is an optional phase. It serves as a cushion between Phase 1 and Phase 3 and bridges this gap: we scan the leaf entries in the initial CF-tree to rebuild a smaller CF-tree, while removing more outliers and grouping more crowded subclusters into larger ones.

Phase 3: Global Clustering

Once all the clustering information is loaded into the in-memory CF-tree, we can use an existing global or semi-global algorithm in Phase 3 to cluster all the leaf entries across the boundaries of different nodes, which removes sparse clusters as outliers and groups dense clusters into larger ones. After this phase, we obtain a set of clusters that captures the major distribution patterns in the data. However, minor and localized inaccuracies might exist.

Phase 4: Optional Refining

Phase 4 is optional and entails the cost of additional passes over the data to correct those inaccuracies and refine the clusters further. It uses the centroids of the clusters produced by Phase 3 as seeds, and redistributes the data points to its closest seed to obtain a set of new clusters. Not only does this allow points belonging to a cluster to migrate, but also it ensures that all copies o a given data point go to the same cluster. This phase can be extended with additional passes if desired by the user, and it has been proved to converge to a minimum.

Effectiveness”How effective is BIRCH?”

The time complexity of the algorithm is O(n), where n is the number of objects to be clustered. Experiments have shown the linear scalability of the algorithm with respect to the number of objects, and good quality of clustering of the data. However, since each node in a CF-tree can hold only a limited number of entries due to its size, a CF-tree node does not always correspond to what a user may consider a natural cluster. Moreover, if the clusters are not spherical in shape, BIRCH does not perform well because it uses the notion of radius or diameter to control the boundary of a cluster. It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance.

Applications

In this section, we will see how BIRCH can be used to solve real-world problems and how it performs on real datasets. Interactive and Iterative Pixel ClassificationThe first application is motivated by the MVI (Multiband Vegetation Imager) technique. The MVI is the combination of charge-coupled device (CCD) camera, a filter exchange mechanism, and a computer used to capture rapid, successive images of plant canopies in two wavelength bands. One image is taken in the visible wavelength band and the other in the near-infrared band. The purpose of using two wavelengths bands is to allow for identification of different canopy components such as sunlit and shaded leaf area, sunlit and shaded branch area, clouds, and blue sky for studying plant canopy architecture. This is important to many fields including ecology, forestry, meteorology, and other agricultural sciences. The main use of BIRCH is to help classify pixels in the MVI images by performing clustering, and experimenting with different feature selection and weighting choices.

Codebook Generalization in Image Compression

Digital image compression is the technology of reducing image data to save storage space and transmission bandwidth. Vector quantization is a widely used image compression/decompression technique which operates on blocks of pixels instead of pixels for better efficiency. In vector quantization the image is first decomposed into small rectangular blocks, and each block is represented as a vector. Given a codebook of size K, it contains K codewords that are vectors serving as seeds to attract other vectors based upon the nearest neighbor criterion. Each vector is encoded with the codebook, i.e., finding its nearest codeword from the codebook and later is decoded with the same codebook, i.e., using its nearest codeword in the codebook as its value.Given the training vectors (from the training image) and the desired codebook (i.e., number of codewords, the main problem of vector quantization is how to generate the codebook. BIRCH can be used to generate the codebook.

With BIRCH clustering the training vectors, from the first 3 phases (the four phases of BIRCH algorithm), with a single scan of the training vectors, the clusters obtained generally capture the major vector distribution patters and only have minor inaccuracies. So in Phase 3, if we set the number of clusters directly as the desired codebook size, and use the centroids of the obtained clusters as the initial codebook, then we can feed them to GLA, an algorithm to find the ‘optimal’ codebook of current size, for further optimization. The initial codebook from the first 3 phases of BIRCH is not likely to lead to a bad locally optimal codebook and using BIRCH to generate the codebook will involve fewer scans of the training vectors.

Summary

BIRCH provides a clustering method for very large datasets. It makes a large clustering problem tractable by concentrating on densely occupied portions, and creating a compact summary. It utilizes measurements that capture the natural closeness of data and can be stored and updated incrementally in a height-balanced tree. BIRCH can work with any given amount of memory, and the I/O complexity is a little more than one scan of data. Experimentally, BIRCH is shown to perform very well on several large datasets, and is significantly superior to other clustering algorithms like CLARANS and KMEANS in terms of quality, speed, stability and scalability overall.