13: Clustering
Previous Next Index
Unsupervised learning  introduction
 Talk about clustering
 Learning from unlabeled data
 Unsupervised learning
 Useful to contras with supervised learning
 Compare and contrast
 Supervised learning
 Given a set of labels, fit a hypothesis to it
 Unsupervised learning
 Try and determining structure in the data
 Clustering algorithm groups data together based on data features
 What is clustering good for
 Market segmentation  group customers into different market segments
 Social network analysis  Facebook "smartlists"
 Organizing computer clusters and data centers for network layout and location
 Astronomical data analysis  Understanding galaxy formation
Kmeans algorithm
 Want an algorithm to automatically group the data into coherent clusters
 Kmeans is by far the most widely used clustering algorithm
Overview
 Take unlabeled data and group into two clusters
 Algorithm overview
 1) Randomly allocate two points as the cluster centroids
 Have as many cluster centroids as clusters you want to do (K cluster centroids, in fact)
 In our example we just have two clusters
 2) Cluster assignment step
 Go through each example and depending on if it's closer to the red or blue centroid assign each point to one of the two clusters
 To demonstrate this, we've gone through the data and "colour" each point red or blue
 3) Move centroid step
 Take each centroid and move to the average of the correspondingly assigned datapoints
 Repeat 2) and 3) until convergence
 More formal definition
 Input:
 K (number of clusters in the data)
 Training set {x^{1}, x^{2}, x^{3} ..., x^{n})
 Algorithm:
 Randomly initialize K cluster centroids as {μ_{1}, μ_{2}, μ_{3} ... μ_{K}}
 Loop 1
 This inner loop repeatedly sets the c^{(i) }variable to be the index of the closes variable of cluster centroid closes to x^{i }
 i.e. take i^{th} example, measure squared distance to each cluster centroid, assign c^{(i)}to the cluster closest
 Loop 2
 Loops over each centroid calculate the average mean based on all the points associated with each centroid from c^{(i)}
 What if there's a centroid with no data
 Remove that centroid, so end up with K1 classes
 Or, randomly reinitialize it
Kmeans for nonseparated clusters
 So far looking at Kmeans where we have well defined clusters
 But often Kmeans is applied to datasets where there aren't well defined clusters
 Not obvious discrete groups
 Say you want to have three sizes (S,M,L) how big do you make these?
 One way would be to run Kmeans on this data
 May do the following
 So creates three clusters, even though they aren't really there
 Look at first population of people
 Try and design a small Tshirt which fits the 1st population
 And so on for the other two
 This is an example of market segmentation
 Build products which suit the needs of your subpopulations
K means optimization objective
 Supervised learning algorithms have an optimization objective (cost function)
 Kmeans has an optimization objective like the supervised learning functions we've seen
 Why is this good?
 Knowing this is useful because it helps for debugging
 Helps find better clusters
 While Kmeans is running we keep track of two sets of variables
 c^{i} is the index of clusters {1,2, ..., K} to which x^{i} is currently assigned
 i.e. there are m c^{i} values, as each example has a c^{i} value, and that value is one the the clusters (i.e. can only be one of K different values)
 μ_{k}, is the cluster associated with centroid k
 Locations of cluster centroid k
 So there are K
 So these the centroids which exist in the training data space
 μ_{c}^{i}, is the cluster centroid of the cluster to which example x^{i} has been assigned to
 This is more for convenience than anything else
 You could look up that example i is indexed to cluster j (using the c vector), where j is between 1 and K
 Then look up the value associated with cluster j in the μ vector (i.e. what are the features associated with μ_{j})
 But instead, for easy description, we have this variable which gets exactly the same value
 Lets say x^{i }as been assigned to cluster 5
 Means that
 c^{i} = 5
 μ_{c}^{i}, = μ_{5}
 Using this notation we can write the optimization objective;
 i.e. squared distances between training example x^{i }and the cluster centroid to which x^{i }has been assigned to
 This is just what we've been doing, as the visual description below shows;
 The red line here shows the distances between the example x^{i }and the cluster to which that example has been assigned
 Means that when the example is very close to the cluster, this value is small
 When the cluster is very far away from the example, the value is large
 This is sometimes called the distortion (or distortion cost function)
 So we are finding the values which minimizes this function;
 If we consider the kmeans algorithm
 The cluster assigned step is minimizing J(...) with respect to c^{1}, c^{2} ... c^{i}
 i.e. find the centroid closest to each example
 Doesn't change the centroids themselves
 The move centroid step
 We can show this step is choosing the values of μ which minimizes J(...) with respect to μ
 So, we're partitioning the algorithm into two parts
 First part minimizes the c variables
 Second part minimizes the J variables
 We can use this knowledge to help debug our Kmeans algorithm
Random initialization
 How we initialize Kmeans
 And how avoid local optimum
 Consider clustering algorithm
 Never spoke about how we initialize the centroids
 A few ways  one method is most recommended
 Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)0
 Randomly pick K training examples
 Set μ_{1} up to μ_{K} to these example's values
 K means can converge to different solutions depending on the initialization setup
 Risk of local optimum
 The local optimum are valid convergence, but local optimum not global ones
 If this is a concern
 We can do multiple random initializations
 See if we get the same result  many same results are likely to indicate a global optimum
 Algorithmically we can do this as follows;
 A typical number of times to initialize Kmeans is 501000
 Randomly initialize Kmeans
 For each 100 random initialization run Kmeans
 Then compute the distortion on the set of cluster assignments and centroids at convergent
 End with 100 ways of cluster the data
 Pick the clustering which gave the lowest distortion
 If you're running K means with 210 clusters can help find better global optimum
 If K is larger than 10, then multiple random initializations are less likely to be necessary
 First solution is probably good enough (better granularity of clustering)
How do we choose the number of clusters?
 Choosing K?
 Not a great way to do this automatically
 Normally use visualizations to do it manually
 What are the intuitions regarding the data?
 Why is this hard
 Sometimes very ambiguous
 e.g. two clusters or four clusters
 Not necessarily a correct answer
 This is why doing it automatic this is hard
Elbow method
 Vary K and compute cost function at a range of K values
 As K increases J(...) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)
 Look for the "elbow" on the graph
 Chose the "elbow" number of clusters
 If you get a nice plot this is a reasonable way of choosing K
 Risks
 Normally you don't get a a nice line > no clear elbow on curve
 Not really that helpful
Another method for choosing K
 Using Kmeans for market segmentation
 Running Kmeans for a later/downstream purpose
 See how well different number of clusters serve you later needs
 e.g.
 Tshirt size example
 If you have three sizes (S,M,L)
 Or five sizes (XS, S, M, L, XL)
 Run K means where K = 3 and K = 5
 How does this look
 This gives a way to chose the number of clusters
 Could consider the cost of making extra sizes vs. how well distributed the products are
 How important are those sizes though? (e.g. more sizes might make the customers happier)
 So applied problem may help guide the number of clusters

