C++ program for performing “k-means clustering”.
This program was written to demonstrate the use of the linear classifiers in the “classify” program. The program can also be used for k-means clustering. It uses an embedded Canny edge detector to perform initial interest point detection, and then uses the FAST library to compute the Euclidean distance transform of the interested points. This is used to generate the “potential” centers, while the “real” centers are estimated from the potential centers. Finally, the centers are adjusted and the results are evaluated.
The program output can be viewed or saved in vector form as a GraphML file.
README: Instructions on how to install and use the program.
ChangeLog: List of all changes between previous versions of the program.
Examples: Two examples of the program demonstrating different modes of use. The first example is similar to the one presented in the Distribution of the ocr package. The second example is for performing k-means clustering.
[Added on 21-Nov-2006 by Christian Gottschlich]
In a very recent update, I added functionality to the program for performing superpixels. Superpixels are contiguous regions in an image that have similar pixel values. Traditionally, the superpixel algorithm is very time-consuming, so it is usually performed on graphics hardware (e.g. GPUs). However, it is sometimes difficult to obtain such hardware, and the superpixels may be too small to be useful.
The program can now compute superpixels, regardless of the number of clusters. There are several algorithms implemented, but the KMeans++ clustering algorithm (e.g. Danelli et al. 2004, Attali et al. 2005, and Srinivasan et al. 2006) seems to work best for my dataset.
[Added on 18-Dec-2007 by Christian Gottschlich]
[New 8-Jan-2008] I’ve updated the program to handle overlapped superpixels, for use on images with overlapping objects.
The program uses a technique called “watershed segmentation” to compute the “mapped” superpixels. Watershed segmentation is basically image segmentation: it splits the image into connected regions according to certain criteria. The optimization criteria here involve merging regions with similar intensity values and forming non-overlapping regions. Watershed segmentation is well-su
The kMeans algorithm is a k-means clustering algorithm that is stable and gives reasonable results.
The procedure begins with an initial guess of centroids, and then repeatedly adjusts them based on the
centroids of their neighborhoods until either all data points are close enough to a center, or a maximum
number of iterations is reached.
The algorithm uses two steps:
Determining the clusters (l_Centers) for each datapoint
Updating the centers
The following code is an excerpt from the datapoint procedure which assigns centroids (l_Centers) to each data point. The idea is to find a partition of datapoints into clusters. In each cluster centroids are assigned to the data points (l_DataPts) of that cluster. After a centroid is assigned to a datapoint, the distance from that datapoint to the centroid is stored in l_Dist. This data is used to update the data points and centroids in the next iteration. The following code snippet shows the steps at a high level.
If the user passes in a flag, the centroids are computed in an initialization phase. Otherwise, they are computed in the main loop. The flag is used to optimize the centroids during the main loop as opposed to the initialization phase.
The l_Count is always computed once, only the l_DataPts and l_DataVect are recomputed in each iteration. The centroids are updated in all iterations. As a result, the complexity of the program is O(N log N).
The distance from a point to a centroid can be thought of as a measure of how well the point is fitting into that cluster. Centroids of higher quality (clusters with better fit) can be defined as having higher centroids. The following variables describe the centroids and their quality.
l_Clusters is a vector containing a cluster number for each data point, initialized with the size of the vector.
l_Centers is a vector containing a cluster ID and centroid vector for each datapoint, initialized with the size of the vector. The indices of the centroids correspond to the indices of the vector.
l_Dist is a matrix containing the distance values from all data points to each centroid, initialized with the size of the matrix. The i-th row of l_Dist corresponds to the distance of the
A set of basic facilities for performing k-means clustering.
KMeans.cpp implements the C++ program for doing k-means clustering based on a combination of local search and Lloyd’s algorithm.
How to run the program:
To run the program under cygwin, you must first configure the makefile.
The top-level makefile contains a `make clean` target to which should be added a `make’ target.
This will ensure that all necessary include files (including gsl and stl) are compiled before the actual make-file starts.
After that, you can run `make’ in the KMeans directory. In the end, the program generates a shell script which you can run in any unix environment (e.g., in cygwin) to perform clustering.
You can also run./kmeans to generate a standalone executable for performing clustering.
This software uses the gsl library, which is included automatically by the script. It also includes the stl library, so you only have to include that in your own programs.
Applications using k-means clustering:
This program can be used for training the k-means clustering algorithm in simpler applications.
It can also be used for inferring clusters that are not specified in advance, which can be handy when inferring clusters from data that does not always fit to a known cluster model.
Due to its complexity, Lloyd’s algorithm is not used in this version of KMeans. However, it can be easily incorporated into future releases.
KMeans can be extended to work with arbitrary data and Euclidean distances, not just the dot product.
Under a license that requires a copy of the GPL notice, KMeans is included in GNU Octave as kmeans.m.
Copyright (C) 2004 by
2004 University of Copenhagen
Silvio Risi and Ken Ribet
University of Hertfordshire
The k-means model is one of the most common models for data clustering. Given a set of N data points (vectors) in m dimensions each represented by a value in the Rn space, or just a vector in the n-dimensional space, the k-means algorithm computes a solution where these data points are partitioned into k sets of points that are called clusters.
The objective of clustering is to partition a given set of data points (represented by the vectors) into k mutually exclusive clusters, or neighborhoods.
Given a set of k centers (barycentric coordinates), the objective of clustering is to place each data point in the right neighborhood (cluster).
KMeans divides the vectors in two sets (called k-means sets):
E: A set of vectors that are in the same cluster (assigned to the same barycentric coordinate).
I: A set of vectors that are not in the same cluster (assigned to a different barycentric coordinate).
In addition to that, it is also necessary to determine if vectors that are in the same cluster should be moved closer, or further away from each other. This is accomplished through a so-called dissimilarity function.
It is commonly accepted that in the vector space (the n-dimensional space), the distance between two vectors in this space can be measured using any of the most common measures, among them Euclidean or Manhattan distances.
The following picture illustrates one possible solution. In this image each cluster is represented by a colored area. For each vector that is in the same cluster, its color is the same. But for vectors that are not in the same cluster, the color is different.
In the picture the following example is used:
The dataset is a set of vectors (the 3 red vectors) generated by applying the random function rand(3) to each of the 3 vectors.
The 3 clusters are created by applying 3 different random barycentric coordinates.
The vector (0,0,0) (the one of the blue color in the picture) is assigned to the center of the cluster.
The cluster with barycentric coordinates (0.5,0,0) contains vectors with values (0.5,0,0), (0.5,0.5,0) and (0,0.5,0). The cluster with barycentric coordinates (0.5,1,0) contains vectors with values
It is the year 2077, and the world has changed beyond recognition. It’s a desolate wasteland, where in order to survive, people have to contend with warring gangs of bandits, and the increasingly widespread infection.
X-MEN: Brotherhood of Mutants is an RPG set in the world of the Marvel comics, where super heroes and supervillains live side by side. As mutants, you will be able to play as your favourite hero – Jean Grey, Storm, Cyclops, Phoenix, Magneto, and the list goes on. Playing as one