Algorithms Reference Clustering
KMeans Clustering
KMeans Description
Given a collection of $n$ records with a pairwise similarity measure, the goal of clustering is to assign a category label to each record so that similar records tend to get the same label. In contrast to multinomial logistic regression, clustering is an unsupervised learning problem with neither category assignments nor label interpretations given in advance. In $k$means clustering, the records $x_1, x_2, \ldots, x_n$ are numerical feature vectors of $\dim x_i = m$ with the squared Euclidean distance $x_i  x_{i’}_2^2$ as the similarity measure. We want to partition $\{x_1, \ldots, x_n\}$ into $k$ clusters $\{S_1, \ldots, S_k\}$ so that the aggregated squared distance from records to their cluster means is minimized:
\[\begin{equation} \textrm{WCSS}\,\,=\,\, \sum_{i=1}^n \,\big\x_i  mean(S_j: x_i\in S_j)\big\_2^2 \,\,\to\,\,\min \end{equation}\]The aggregated distance measure in (1) is called the withincluster sum of squares (WCSS). It can be viewed as a measure of residual variance that remains in the data after the clustering assignment, conceptually similar to the residual sum of squares (RSS) in linear regression. However, unlike for the RSS, the minimization of (1) is an NPhard problem [AloiseDHP2009].
Rather than searching for the global optimum in (1), a heuristic algorithm called Lloydâ€™s algorithm is typically used. This iterative algorithm maintains and updates a set of $k$ centroids $\{c_1, \ldots, c_k\}$, one centroid per cluster. It defines each cluster $S_j$ as the set of all records closer to $c_j$ than to any other centroid. Each iteration of the algorithm reduces the WCSS in two steps:
 Assign each record to the closest centroid, making $mean(S_j)\neq c_j$
 Reset each centroid to its clusterâ€™s mean: $c_j := mean(S_j)$
After Step 1, the centroids are generally different from the cluster means, so we can compute another “withincluster sum of squares” based on the centroids:
\[\textrm{WCSS_C}\,\,=\,\, \sum_{i=1}^n \,\big\x_i  \mathop{\textrm{centroid}}(S_j: x_i\in S_j)\big\_2^2 \label{eqn:WCSS:C}\]This WCSS_C after Step 1 is less than the meansbased WCSS before Step 1 (or equal if convergence achieved), and in Step 2 the WCSS cannot exceed the WCSS_C for the same clustering; hence the WCSS reduction.
Exact convergence is reached when each record becomes closer to its clusterâ€™s mean than to any other clusterâ€™s mean, so there are no more reassignments and the centroids coincide with the means. In practice, iterations may be stopped when the reduction in WCSS (or in WCSS_C) falls below a minimum threshold, or upon reaching the maximum number of iterations. The initialization of the centroids is also an important part of the algorithm. The smallest WCSS obtained by the algorithm is not the global minimum and varies depending on the initial centroids. We implement multiple parallel runs with different initial centroids and report the best result.
Scoring. Our scoring script evaluates the clustering output by comparing it with a known category assignment. Since cluster labels have no prior correspondence to the categories, we cannot count “correct” and “wrong” cluster assignments. Instead, we quantify them in two ways:
 Count how many samecategory and differentcategory pairs of records end up in the same cluster or in different clusters;
 For each category, count the prevalence of its most common cluster; for each cluster, count the prevalence of its most common category.
The number of categories and the number of clusters ($k$) do not have to be equal. A samecategory pair of records clustered into the same cluster is viewed as a “true positive,” a differentcategory pair clustered together is a “false positive,” a samecategory pair clustered apart is a “false negative” etc.
Table 6
The Ofile for Kmeanspredict provides the output statistics in CSV format, one per line, in the following format: (NAME, [CID], VALUE). Note: the 1st group statistics are given if X input is available; the 2nd group statistics are given if X and C inputs are available; the 3rd and 4th group statistics are given if spY input is available; only the 4th group statistics contain a nonempty CID value; when present, CID contains either the specified category label or the predicted cluster label.
Inputs Available  Name  CID  Meaning 

X  TSS  Total Sum of Squares (from the total mean)  
WCSS_M  WithinCluster Sum of Squares (means as centers)  
WCSS_M_PC  WithinCluster Sum of Squares (means), in % of TSS  
BCSS_M  BetweenCluster Sum of Squares (means as centers)  
BCSS_M_PC  BetweenCluster Sum of Squares (means), in % of TSS  
X and C  WCSS_C  WithinCluster Sum of Squares (centroids as centers)  
WCSS_C_PC  WithinCluster Sum of Squares (centroids), % of TSS  
BCSS_C  BetweenCluster Sum of Squares (centroids as centers)  
BCSS_C_PC  BetweenCluster Sum of Squares (centroids), % of TSS  
spY  TRUE_SAME_CT  Samecategory pairs predicted as Samecluster, count  
TRUE_SAME_PC  Samecategory pairs predicted as Samecluster, %  
TRUE_DIFF_CT  Diffcategory pairs predicted as Diffcluster, count  
TRUE_DIFF_PC  Diffcategory pairs predicted as Diffcluster, %  
FALSE_SAME_CT  Diffcategory pairs predicted as Samecluster, count  
FALSE_SAME_PC  Diffcategory pairs predicted as Samecluster, %  
FALSE_DIFF_CT  Samecategory pairs predicted as Diffcluster, count  
FALSE_DIFF_PC  Samecategory pairs predicted as Diffcluster, %  
spY  SPEC_TO_PRED  +  For specified category, the best predicted cluster id 
SPEC_FULL_CT  +  For specified category, its full count  
SPEC_MATCH_CT  +  For specified category, bestcluster matching count  
SPEC_MATCH_PC  +  For specified category, % of matching to full count  
PRED_TO_SPEC  +  For predicted cluster, the best specified category id  
PRED_FULL_CT  +  For predicted cluster, its full count  
PRED_MATCH_CT  +  For predicted cluster, bestcategory matching count  
PRED_MATCH_PC  +  For predicted cluster, % of matching to full count 
KMeans Details
Our clustering script proceeds in 3 stages: centroid initialization,
parallel $k$means iterations, and the bestavailable output generation.
Centroids are initialized at random from the input records (the rows
of $X$), biased towards being chosen far apart from each other. The
initialization method is based on the kmeans++
heuristic
from [ArthurVassilvitskii2007], with one important difference: to
reduce the number of passes through $X$, we take a small sample of $X$
and run the kmeans++
heuristic over this sample. Here is,
conceptually, our centroid initialization algorithm for one clustering
run:
 Sample the rows of $X$ uniformly at random, picking each row with
probability $p = ks / n$ where
 $k$ is the number of centroids
 $n$ is the number of records
 $s$ is the samp input parameter
If $ks \geq n$, the entire $X$ is used in place of its sample.
 Choose the first centroid uniformly at random from the sampled rows.
 Choose each subsequent centroid from the sampled rows, at random, with probability proportional to the squared Euclidean distance between the row and the nearest alreadychosen centroid.
The sampling of $X$ and the selection of centroids are performed independently and in parallel for each run of the $k$means algorithm. When we sample the rows of $X$, rather than tossing a random coin for each row, we compute the number of rows to skip until the next sampled row as $\lceil \log(u) / \log(1  p) \rceil$ where $u\in (0, 1)$ is uniformly random. This timesaving trick works because
\[Prob[k1 < \log_{1p}(u) < k] \,\,=\,\, p(1p)^{k1} \,\,=\,\, Prob[\textrm{skip $k1$ rows}]\]However, it requires us to estimate the maximum sample size, which we set near $ks + 10\sqrt{ks}$ to make it generous enough.
Once we selected the initial centroid sets, we start the $k$means iterations independently in parallel for all clustering runs. The number of clustering runs is given as the runs input parameter. Each iteration of each clustering run performs the following steps:
 Compute the centroiddependent part of squared Euclidean distances from all records (rows of $X$) to each of the $k$ centroids using matrix product.
 Take the minimum of the above for each record.
 Update the current withincluster sum of squares (WCSS) value, with centroids substituted instead of the means for efficiency.

Check the convergence
criterion:
\[\textrm{WCSS}_{\mathrm{old}}  \textrm{WCSS}_{\mathrm{new}} < {\varepsilon}\cdot\textrm{WCSS}_{\mathrm{new}}\]as well as the number of iterations limit.
 Find the closest centroid for each record, sharing equally any records with multiple closest centroids.
 Compute the number of records closest to each centroid, checking for “runaway” centroids with no records left (in which case the run fails).
 Compute the new centroids by averaging the records in their clusters.
When a termination condition is satisfied, we store the centroids and the WCSS value and exit this run. A run has to satisfy the WCSS convergence criterion to be considered successful. Upon the termination of all runs, we select the smallest WCSS value among the successful runs, and write out this runâ€™s centroids. If requested, we also compute the cluster assignment of all records in $X$, using integers from 1 to $k$ as the cluster labels. The scoring script can then be used to compare the cluster assignment with an externally specified category assignment.
Returns
We output the $k$ centroids for the best available clustering,
i. e. whose WCSS is the smallest of all successful runs. The centroids
are written as the rows of the $k\,{\times}\,m$matrix into the output
file whose path/name was provided as the C
input
argument. If the input parameter isY
was set
to 1
, we also output the onecolumn matrix with the cluster
assignment for all the records. This assignment is written into the file
whose path/name was provided as the Y
input argument. The
best WCSS value, as well as some information about the performance of
the other runs, is printed during the script execution. The scoring
script Kmeanspredict.dml
prints all its results in a
selfexplanatory manner, as defined in
Table 6.