ISYE 6740 - Homework 1

Question 1: Image compression using clustering

Note: The individual steps of the k-means algorithm are displayed below before putting them all together in the "Full Process - Question 1.1" section

Importing libraries and football.bmp image

Reshaping array from (412, 620, 3) to (255440, 3) in order to represent the image as 1 RGB pixel per row

Referenced: https://stackoverflow.com/questions/10439104/reading-bmp-files-in-python

Randomly initialize cluster centers

REPEAT: a) Assign pixels to nearest cluster and b) adjust cluster centers

Display compressed image

Full Process - Question 1.1

The k_means_compress function below takes 2 inputs (image pixel values and k, the number of desired clusters) and returns 2 outputs (each pixel's assigned cluster label and the values of the clustering centroids).

The k_means_compress function has 3 main steps:

1) Randomly initialize cluster centers
2) Repeatedly 1) assign pixels to their nearest cluster and 2) adjust cluster centers to the average pixel value of those assigned to it until the centroid locations stop changing significantly
3) Display compressed image using a number of pixel values equal to the number of set cluster centers

Note: I used the image files as my input, rather than the pixels array, in order to retain the dimensions of the original image for eventually outputting a compressed image. This can be easily switched back to inputs=(pixels, k) by performing the pixels calculation outside of the 'k_means_compress' function but I thought it was helpful to display the compressed image.

football.bmp

Below are the runtimes and iterations for k-means clustering convergence for the football.bmp image compression:

Compressed images are displayed in order below, along with 1) each pixel's class and 2) centroid locations.

GeorgiaTech.bmp

Below are the runtimes and iterations for k-means clustering convergence for the GeorgiaTech.bmp image compression:

Compressed images are displayed in order below, along with 1) each pixel's class and 2) centroid locations.

XING_B24.bmp (my sample image)

Below are the runtimes and iterations for k-means clustering convergence for the XING_B24.bmp image compression:

Compressed images are displayed in order below, along with 1) each pixel's class and 2) centroid locations.

Question 1.2

The k_means_compress_poor function below is an exact replication of the k_means_compress function above, but with a poor centroid initialization strategy.

The poor initialization strategy I am using involves setting a maximum pixel value of 100 for each centroid (instead of the usual 255). This leads to more localized initial cluster centers and is more likely to converge at a local optima when minimizing the L2-norm objective function.

Comparing random cluster initialization to poor cluster initialization using the sample image

After multiple runs using the sample image and k=4, it is clear that random cluster initialization is a superior method to the poor cluster initialization strategy.

Metrics from each initialization strategy, along with the accompanying compressed images, are below:

Although the same number of iterations were needed for each method, the compression using random initialization consistently required more runtime. This is likely a result of the poor initialization strategy converging to local optima before the random intialization strategy could converge to more useful & distant optima. As seen in the images below, the image produced via random initialization has more distinct color groupings, indicating a more effective spread of initial cluster centers.

Question 1.3

The k_means_compress_manhattan function below is an exact replication of the k_means_compress function above, but using Manhattan distance as the distance metric and adjusting centroid location using median cluster values instead of mean cluster values.

Comments: Both the L2-norm and L1-norm distance methods produced compressed images of similar appearance. However, the clustering runtime for the L1-norm method generally required more iterations and a longer runtime to converge. This effect became more pronounced as the number of clusters increased.

The reason for this increase in runtime and iterations when moving from Euclidean distance to Manhattan distance is a result of using the same convergence condition for both methods. Realistically, the clustering using Manhattan distance could have a higher convergence threshold (and therefore not take quite as much runtime) while maintaining a similar effectiveness as the method using Euclidian distance.

football.bmp

Below are the runtimes and iterations for k-means clustering convergence for the football.bmp image compression (using Manhattan distance):

Compressed images are displayed in order below, along with 1) each pixel's class and 2) centroid locations.

GeorgiaTech.bmp

Below are the runtimes and iterations for k-means clustering convergence for the GeorgiaTech.bmp image compression (using Manhattan distance):

Compressed images are displayed in order below, along with 1) each pixel's class and 2) centroid locations.

XING_B24.bmp (my sample image)

Below are the runtimes and iterations for k-means clustering convergence for the XING_B24.bmp image compression (using Manhattan distance):

Compressed images are displayed in order below, along with 1) each pixel's class and 2) centroid locations.

Question 2: Spectral clustering and discover football colleague

Question 2.1

Create adjacency matrix using the simple graph provided:

Generate the degree matrix using the adjacency matrix (formed by summing the number of adjacent nodes for each of the 5 nodes in the graph)

Eigenvectors associated with zero eigenvalue:

Explanation:

The multiplicty of the eigenvalue 0 is 2, corresponding to 2 connected components in the graph.

These 2 eigenvectors include cluster assignment information. The first eignvector has nonzero values in positions 0, 1, and 2, corresponding to the cluster between nodes 1, 2, and 3 in the graph. The second eigenvector has nonzero values in positions 3 and 4, corresponding to the cluster between nodes 4 and 5 in the graph.

Question 2.2

The eigendecomposition of the normalized Laplacian matrix is above. Below are the plotted eigenvalues (ranked largest to smallest).

Based on the plot, I would say there should be approximately 17 clusters. I am choosing about 17 clusters because, when considering the normalized Laplacian's eigendecomposition, one wants to select the k largest eigenvalues. After the first 17 or so sorted eigenvalues, there is a large drop in value (from ~0.6 to ~0.4). I would treat this as a distinct cutoff for the "large" eigenvalues.

Question 2.3

The cells below display the largest and smallest cluster sizes for k = 5, 7, 10, and 17 (my approximation)

k=5

k=7

k=10

k=17

Cluster results for k = 10, listing the teams that belong to each cluster

Question 2.4

Results between different runs of k-means clustering with k=10 are slightly different because of how the clusters are initialized. The default initialization strategy, 'k-means++', speeds up the convergence process (when compared to random initialization) but still changes from run to run. This is why the clusters are slightly different between runs.

Between all of my k=10 runs, “Georgia Tech”, “Georgia State”, and “Georgia” (UGA) have always been in the same cluster. Despite the changing cluster initialization mentioned above, “Georgia Tech”, “Georgia State”, and “Georgia” must be interconnected enough that the vast majority of initializations will still group them togther into the same cluster.

Question 2.5

When repeatedly running k-means with k=10, the sizes of the largest and smallest clusters changed slightly with every run. I found this interesting because it was a direct view into the fact that we use k-means as a heuristic and, although it is very useful, it will not provide the optimal solution every time it is used.