Here the .gif images for each subject are imported, converted to matrices, and vectorized. The shape of each image is 243x320, meaning we will need to lower the resolution of each image to about 60x80.
# Referenced: https://stackoverflow.com/questions/26392336/importing-images-from-a-directory-python-to-list-or-dictionary
from PIL import Image
import glob
import numpy as np
import math
import matplotlib.pyplot as plt
import scipy.io as spio
import scipy.sparse.linalg as ll
import sklearn.preprocessing as skpp
from scipy.spatial import distance_matrix
from scipy.spatial.distance import pdist
pictures = []
A = np.empty(60*80)
for pic in glob.glob('data/yalefaces/*.gif'):
# import .gif file
face = Image.open(pic)
# reduce to lower resolution by factor of 4
new_size = (face.size[0]//4, face.size[1]//4)
# resize image
im_resized = face.resize(new_size, Image.ANTIALIAS)
# add to list of images
pictures.append(im_resized)
# convert image to numpy array
matrix = np.asarray(im_resized)
# flatten matrix into a single vector
vector = matrix.flatten()
# append image vector to data set
A = np.vstack([A, vector])
A = A[1:].copy()
# 21 total images stored as array rows
A.shape
(21, 4800)
Each row represents a lower resolution (i.e. 60x80), vectorized greyscale image.
A_1_test = A[0].copy()
A_1_train = A[1:11].copy()
A_2_test = A[11].copy()
A_2_train = A[12:].copy()
# diplay test image 1
pictures[0]
# display test image 2
pictures[11]
Finding weight vectors to extract different "eigenfaces" that correspond to each subject's pictures' first few principal components. We don't need to normalize the features before PCA because they are all already on the 0 to 255 scale for a pixel in a black and white image.
# adapted from 6740 demo code
# PCA
m, n = A_1_train.shape
mu_1 = np.mean(A_1_train,axis = 1)
xc = A_1_train - mu_1[:,None]
C = np.dot(xc,xc.T)/m
# extract first 6 eigenvectors for the reduced representation of each image
K = 6
S,W = ll.eigs(C,k = K)
S = S.real
W = W.real
dim1_1 = np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalue
dim2_1 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue
dim3_1 = np.dot(W[:,2].T,xc)/math.sqrt(S[2]) # extract 3rd eigenvalue
dim4_1 = np.dot(W[:,3].T,xc)/math.sqrt(S[3]) # extract 4th eigenvalue
dim5_1 = np.dot(W[:,4].T,xc)/math.sqrt(S[4]) # extract 5th eigenvalue
dim6_1 = np.dot(W[:,5].T,xc)/math.sqrt(S[5]) # extract 6th eigenvalue
# reshape eigenvectors for visualization
z1_1 = np.reshape(dim1_1, (60,80))
z2_1 = np.reshape(dim2_1, (60,80))
z3_1 = np.reshape(dim3_1, (60,80))
z4_1 = np.reshape(dim4_1, (60,80))
z5_1 = np.reshape(dim5_1, (60,80))
z6_1 = np.reshape(dim6_1, (60,80))
from skimage import io
io.imshow(z1_1)
<matplotlib.image.AxesImage at 0x1ad89103f48>
io.imshow(z2_1)
<matplotlib.image.AxesImage at 0x1ad89234b48>
io.imshow(z3_1)
<matplotlib.image.AxesImage at 0x1ad89305ac8>
io.imshow(z4_1)
<matplotlib.image.AxesImage at 0x1ad893b9088>
io.imshow(z5_1)
<matplotlib.image.AxesImage at 0x1ad8947b1c8>
io.imshow(z6_1)
<matplotlib.image.AxesImage at 0x1ad8952fdc8>
# adapted from 6740 demo code
# PCA
m, n = A_2_train.shape
mu_2 = np.mean(A_2_train,axis = 1)
xc = A_2_train - mu_2[:,None]
C = np.dot(xc,xc.T)/m
# extract first 6 eigenvectors for the reduced representation of each image
K = 6
S,W = ll.eigs(C,k = K)
S = S.real
W = W.real
dim1_2 = np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalues
dim2_2 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue
dim3_2 = np.dot(W[:,2].T,xc)/math.sqrt(S[2]) # extract 3rd eigenvalue
dim4_2 = np.dot(W[:,3].T,xc)/math.sqrt(S[3]) # extract 4th eigenvalue
dim5_2 = np.dot(W[:,4].T,xc)/math.sqrt(S[4]) # extract 5th eigenvalue
dim6_2 = np.dot(W[:,5].T,xc)/math.sqrt(S[5]) # extract 6th eigenvalue
# reshape eigenvectors for visualization
z1_2 = np.reshape(dim1_2, (60,80))
z2_2 = np.reshape(dim2_2, (60,80))
z3_2 = np.reshape(dim3_2, (60,80))
z4_2 = np.reshape(dim4_2, (60,80))
z5_2 = np.reshape(dim5_2, (60,80))
z6_2 = np.reshape(dim6_2, (60,80))
io.imshow(z1_2)
<matplotlib.image.AxesImage at 0x1ad895eba08>
io.imshow(z2_2)
<matplotlib.image.AxesImage at 0x1ad896a4d88>
io.imshow(z3_2)
<matplotlib.image.AxesImage at 0x1ad8976af88>
io.imshow(z4_2)
<matplotlib.image.AxesImage at 0x1ad8a7faa48>
io.imshow(z5_2)
<matplotlib.image.AxesImage at 0x1ad8a89b388>
io.imshow(z6_2)
<matplotlib.image.AxesImage at 0x1ad8a966e48>
There appear to be similar patterns in the first 6 eigenfaces for both subjects. The first eigenface, or the image reconstructed using the eigenvector with the largest eigenvalue, appears to be differentiating based on the person's face vs the background. This makes sense because that would be a robust, consistent feature of each image.
From there, the eigenfaces seem to be based off of more and more detailed, less consistent features. For example, the seecond eigenface (for both subjects) appears to capture the shadow created by angled photos of the subject. Then, the orientation of the face and facial features begin to be captured by later eigenfaces.
Capture pixel means from each of the two training sets
mu_3 = np.mean(A_1_train,axis = 0)
mu_4 = np.mean(A_2_train,axis = 0)
s = np.zeros((2,2))
# subject 1 test picture with subject 1 top eigenface
s[0][0] = np.linalg.norm(((A_1_test - mu_3) - dim1_1 * np.dot(dim1_1.T, (A_1_test - mu_3))), ord=2)**2
print('Subject 1 test picture projection residual with subject 1 top eigenface:', round(np.linalg.norm(((A_1_test - mu_3) - dim1_1 * np.dot(dim1_1.T, (A_1_test - mu_3))), ord=2)**2,1))
# subject 1 test picture with subject 2 top eigenface
s[1][0] = np.linalg.norm(((A_1_test - mu_4) - dim1_2 * np.dot(dim1_2.T, (A_1_test - mu_4))), ord=2)**2
print('Subject 1 test picture projection residual with subject 2 top eigenface:', round(np.linalg.norm(((A_1_test - mu_4) - dim1_2 * np.dot(dim1_2.T, (A_1_test - mu_4))), ord=2)**2,1))
# subject 2 test picture with subject 1 top eigenface
s[0][1] = np.linalg.norm(((A_2_test - mu_3) - dim1_1 * np.dot(dim1_1.T, (A_2_test - mu_3))), ord=2)**2
print('Subject 2 test picture projection residual with subject 1 top eigenface:', round(np.linalg.norm(((A_2_test - mu_3) - dim1_1 * np.dot(dim1_1.T, (A_2_test - mu_3))), ord=2)**2,1))
# subject 2 test picture with subject 2 top eigenface
s[1][1] = np.linalg.norm(((A_2_test - mu_4) - dim1_2 * np.dot(dim1_2.T, (A_2_test - mu_4))), ord=2)**2
print('Subject 2 test picture projection residual with subject 2 top eigenface:', round(np.linalg.norm(((A_2_test - mu_4) - dim1_2 * np.dot(dim1_2.T, (A_2_test - mu_4))), ord=2)**2,1))
s
Subject 1 test picture projection residual with subject 1 top eigenface: 67065784.6 Subject 1 test picture projection residual with subject 2 top eigenface: 1135327013.2 Subject 2 test picture projection residual with subject 1 top eigenface: 797562589.0 Subject 2 test picture projection residual with subject 2 top eigenface: 6292442.8
array([[6.70657846e+07, 7.97562589e+08], [1.13532701e+09, 6.29244277e+06]])
The projection residuals computed above are essentially communicating the difference between each eigenface and the vectorized test image. A smaller projection residual is associated with greater similarity between a subject's eigenface and the test image.
Looking at the projection residual matrix s above, we can see that, when a test image is matched with the correct eigenface, the projection residual value is at least an order of magnitude less than the projection residuals in mismatched cases (e.g., subject 2's test image with the top eigenface of subject 1).
Using this knowledge, a binary classification threshold could be established that is compared against projection residuals in order to determine whether or not it is likely that the test image being evaluated is a match with the eigenface it is being compared against.
I do think this facial recognition strategy can work well. Given the opportunity to improve this system, I would look into making use of more than just the top eigenface for each class.
The top eigenvectors could be selected using an elbow diagram of the absolute values of their eigenvalues. Then, after weighting each eigenface by the magnitude of its corresponding eigenvalue, the multiple eigenfaces could be blended into a more holistic representation of the face. This blended eigenface could then be compared to test images using the projection residual calculation and a classification threshold.
$\epsilon$-ISOMAP (Isometric Feature Mapping) is applied in order to achieve non-linear feature reduction
Import images for ISOMAP
import scipy.io
mat = scipy.io.loadmat('data/isomap.mat')
mat['images'].shape
(4096, 698)
Create data matrix that contains 698 vectorized images as rows
x = mat['images'].T.copy()
x.shape
(698, 4096)
Create adjacency matrix A (aka the weighted nearest neighbor graph). Without using the $\epsilon$ threshold, the mean distance is about 21.
A = np.zeros((x.shape[0], x.shape[0]))
epsilon = 13
for i in range(A.shape[0]):
for j in range(A.shape[1]):
distance = np.linalg.norm(x[i] - x[j], ord=2)
if distance <= epsilon:
A[i][j] = distance
else:
A[i][j] = 0
A
array([[0. , 0. , 6.74323967, ..., 0. , 0. , 0. ], [0. , 0. , 0. , ..., 0. , 0. , 0. ], [6.74323967, 0. , 0. , ..., 0. , 0. , 0. ], ..., [0. , 0. , 0. , ..., 0. , 0. , 0. ], [0. , 0. , 0. , ..., 0. , 0. , 0. ], [0. , 0. , 0. , ..., 0. , 0. , 0. ]])
import networkx as nx
import numpy as np
G = nx.from_numpy_matrix(A)
#nx.draw(G, with_labels=False)
nx.draw_networkx(G, node_size=4, width=0.1, with_labels=True, alpha=0.5, node_color ='red')
plt.show()
The following two images are located on the bottom left of the adjacency matrix graph and are generally facing to the right:
plt.imshow(x[613].reshape((64,64), order='F'))
<matplotlib.image.AxesImage at 0x1ad8e129548>
plt.imshow(x[506].reshape((64,64), order='F'))
<matplotlib.image.AxesImage at 0x1ad98dd4d88>
The following three images are located on the right of the adjacency matrix graph and are generally facing to the left:
plt.imshow(x[143].reshape((64,64), order='F'))
<matplotlib.image.AxesImage at 0x1ad91e026c8>
plt.imshow(x[538].reshape((64,64), order='F'))
<matplotlib.image.AxesImage at 0x1ad8df51ac8>
plt.imshow(x[694].reshape((64,64), order='F'))
<matplotlib.image.AxesImage at 0x1ad8df6bac8>
Step 2) Using the weighted graph A, compute the pairwise shortest distance matrix D
Given a distance matrix A (m x m), this returns shortest path matrix D (m x m) where D[i, j] = the shortest path from image i to image j along the graph. The shortest path is computed using the Floyd-Warshall algorithm.
A_new = np.zeros((x.shape[0], x.shape[0]))
epsilon = 12
for i in range(A_new.shape[0]):
for j in range(A_new.shape[1]):
# Euclidean distance
distance = np.linalg.norm(x[i] - x[j], ord=2)
if distance <= epsilon:
A_new[i][j] = distance
else:
A_new[i][j] = 0
# check to see how many neighbors are retained for the first image
np.count_nonzero(A_new[0])
48
import sklearn
from sklearn.utils import graph_shortest_path
# referenced https://scikit-learn.org/stable/modules/...
# generated/sklearn.utils.graph_shortest_path.graph_shortest_path.html
D = graph_shortest_path.graph_shortest_path(A_new)
Step 3) Use the centering matrix H to compute matrix C
m = D.shape[0]
H = np.eye(m) - (1/m)*(np.ones((m,m)))
C = (-1/2) * np.dot(np.dot(H, D**2), H)
Step 4) Compute leading two eigenvectors w1, w2 and eigenvalues $\lambda$1, $\lambda$2 of C
from scipy.linalg import eig
from scipy.sparse.linalg import eigs
eig_val, eig_vec = eig(C)
eig_val = eig_val.real
eig_vec = eig_vec.real
idx_sorted = np.argsort(eig_val)[::-1] # the index of eigenvalue sorted acsending
val = eig_val[idx_sorted][0:2]
vec = eig_vec[idx_sorted][0:2]
Z is storing the two-dimensional low-dimensional embedding
Z = np.dot(eig_vec[:,:2],np.diag(np.sqrt(eig_val[:2])))
Z
array([[ 1.94219139e+01, -2.18810514e+00], [-2.10522575e+01, 1.80567930e-02], [ 1.93611559e+01, -6.87453411e+00], ..., [-9.70006432e+00, 1.84205562e+01], [-2.68295210e+01, -7.53770948e+00], [ 1.05449547e+01, 6.87418680e+00]])
Scatterplot showing the overall 2D pattern similar to the lecture slides
plt.scatter(Z[:,0],Z[:,1])
<matplotlib.collections.PathCollection at 0x21c8e3502c8>
# plotting code adapted from this blog:
# https://benalexkeen.com/isomap-for-dimensionality-reduction-in-python/
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import pandas as pd
# generate DataFrame from lower-dimensional representation
manifold = pd.DataFrame(Z, columns=['Component 1', 'Component 2'])
# generate DataFrame from original dataset to layer images on plot
df = pd.DataFrame(x)
num_images, num_pixels = df.shape
pixels_per_dimension = int(math.sqrt(num_pixels))
# Rotate images
for idx in df.index:
df.loc[idx] = df.loc[idx].values.reshape(pixels_per_dimension, pixels_per_dimension).T.reshape(-1)
# display two-dimensional representation
fig = plt.figure()
fig.set_size_inches(10, 10)
ax = fig.add_subplot(111)
ax.set_title('2D Components from Isomap of Facial Images')
ax.set_xlabel('Component: 1')
ax.set_ylabel('Component: 2')
# layer 20 images on the plot
x_size = (max(manifold['Component 1']) - min(manifold['Component 1'])) * 0.08
y_size = (max(manifold['Component 2']) - min(manifold['Component 2'])) * 0.08
for i in range(20):
img_num = np.random.randint(0, 698)
x0 = manifold.loc[img_num, 'Component 1'] - (x_size / 2.)
y0 = manifold.loc[img_num, 'Component 2'] - (y_size / 2.)
x1 = manifold.loc[img_num, 'Component 1'] + (x_size / 2.)
y1 = manifold.loc[img_num, 'Component 2'] + (y_size / 2.)
img = df.iloc[img_num,:].values.reshape(pixels_per_dimension, pixels_per_dimension)
ax.imshow(img, aspect='auto', cmap=plt.cm.gray,
interpolation='nearest', zorder=100000, extent=(x0, x1, y0, y1))
# Show 2D components plot
ax.scatter(manifold['Component 1'], manifold['Component 2'], marker='.',alpha=0.7)
ax.set_ylabel('Up-Down Pose')
ax.set_xlabel('Right-Left Pose')
plt.show()
The scatterplot above shows the embedding space with 20 images accompanying various data points. The overall pattern of the two-dimensional plot generally matches the results from the paper and the lecture slides when using epsilon = 12 (i.e. the distance theshold for the weighted nearest neighbor graph).
One key difference is that the x-axis in my visual is displaying the faces from right-left pose, whereas the paper displays the faces with left-right poses on the x-axis. However, like the paper and the lecture slides, the y-axis is displaying the images from Up-Down poses (i.e. faces the bottom are looking down and faces at the top are looking up).
When using Euclidean distance, $\epsilon$=12 resulted in 48 neighbors for the first image. Therefore, when choosing epsilon for Manhattan distance, I will start by trying to achieve a similar number of neighbors for the first image.
A_new = np.zeros((x.shape[0], x.shape[0]))
epsilon = 515
for i in range(A_new.shape[0]):
for j in range(A_new.shape[1]):
# Manhattan distance
distance = np.linalg.norm(x[i] - x[j], ord=1)
if distance <= epsilon:
A_new[i][j] = distance
else:
A_new[i][j] = 0
# check to see how many neighbors are retained for the first image
np.count_nonzero(A_new[0])
49
# referenced https://scikit-learn.org/stable/modules/...
#generated/sklearn.utils.graph_shortest_path.graph_shortest_path.html
D = graph_shortest_path.graph_shortest_path(A_new)
m = D.shape[0]
H = np.eye(m) - (1/m)*(np.ones((m,m)))
C = (-1/2) * np.dot(np.dot(H, D**2), H)
eig_val, eig_vec = eig(C)
eig_val = eig_val.real
eig_vec = eig_vec.real
idx_sorted = np.argsort(eig_val)[::-1] # the index of eigenvalue sorted acsending
val = eig_val[idx_sorted][0:2]
vec = eig_vec[idx_sorted][0:2]
Z = np.dot(eig_vec[:,:2] ,np.diag(np.sqrt(eig_val[:2])))
# plotting code adapted from this blog:
# https://benalexkeen.com/isomap-for-dimensionality-reduction-in-python/
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import pandas as pd
# generate DataFrame from lower-dimensional representation
manifold = pd.DataFrame(Z, columns=['Component 1', 'Component 2'])
# generate DataFrame from original dataset to layer images on plot
df = pd.DataFrame(x)
num_images, num_pixels = df.shape
pixels_per_dimension = int(math.sqrt(num_pixels))
# Rotate images
for idx in df.index:
df.loc[idx] = df.loc[idx].values.reshape(pixels_per_dimension, pixels_per_dimension).T.reshape(-1)
# display two-dimensional representation
fig = plt.figure()
fig.set_size_inches(10, 10)
ax = fig.add_subplot(111)
ax.set_title('2D Components from Isomap of Facial Images')
ax.set_xlabel('Component: 1')
ax.set_ylabel('Component: 2')
# layer 20 images on the plot
x_size = (max(manifold['Component 1']) - min(manifold['Component 1'])) * 0.08
y_size = (max(manifold['Component 2']) - min(manifold['Component 2'])) * 0.08
for i in range(20):
img_num = np.random.randint(0, 698)
x0 = manifold.loc[img_num, 'Component 1'] - (x_size / 2.)
y0 = manifold.loc[img_num, 'Component 2'] - (y_size / 2.)
x1 = manifold.loc[img_num, 'Component 1'] + (x_size / 2.)
y1 = manifold.loc[img_num, 'Component 2'] + (y_size / 2.)
img = df.iloc[img_num,:].values.reshape(pixels_per_dimension, pixels_per_dimension)
ax.imshow(img, aspect='auto', cmap=plt.cm.gray,
interpolation='nearest', zorder=100000, extent=(x0, x1, y0, y1))
# Show 2D components plot
ax.scatter(manifold['Component 1'], manifold['Component 2'], marker='.',alpha=0.7)
ax.set_ylabel('Down-Up Pose')
ax.set_xlabel('Left-Right Pose')
plt.show()
A similar pattern was achieved by using $\epsilon$=515 when calculating neighbor distance with Manhattan distance. Because of the new distance norm, the neighbor threshold had to be recalculated. The starting point for this recalculating was based on achieving a similar number of nearest neighbors for the first image as was achieved using Euclidean distance.
The overall pattern of the two-dimensional low-dimensional representation of the data looks very similar to what was achieved using Euclidean distance. However, the orientation of the faces has now been flipped. The y-axis is now down-up (it was up-down before) and the x-axis is now left-right (it was right-left before).
# adapted from 6740 demo code
# PCA
m, n = x.shape
mu_1 = np.mean(x,axis = 1)
xc = x - mu_1[:,None]
C = np.dot(xc,xc.T)/m
# extract first 2 eigenvectors
K = 2
S,W = ll.eigs(C,k = K)
S = S.real
W = W.real
dim1_1 = np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalues
dim2_1 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue
# extract the top 2 principal components
Z_PCA = pd.DataFrame((dim1_1, dim2_1)).transpose()
Z_PCA.columns = ['Component 1', 'Component 2']
# plotting code adapted from this blog:
# https://benalexkeen.com/isomap-for-dimensionality-reduction-in-python/
# generate DataFrame from original dataset to layer images on plot
df = pd.DataFrame(x)
num_images, num_pixels = df.shape
pixels_per_dimension = int(math.sqrt(num_pixels))
# Rotate images
for idx in df.index:
df.loc[idx] = df.loc[idx].values.reshape(pixels_per_dimension, pixels_per_dimension).T.reshape(-1)
# display two-dimensional representation
fig = plt.figure()
fig.set_size_inches(10, 10)
ax = fig.add_subplot(111)
ax.set_title('2D Components from PCA of Facial Images')
ax.set_xlabel('Component: 1')
ax.set_ylabel('Component: 2')
# layer 20 images on the plot
x_size = (max(Z_PCA['Component 1']) - min(Z_PCA['Component 1'])) * 0.08
y_size = (max(Z_PCA['Component 2']) - min(Z_PCA['Component 2'])) * 0.08
for i in range(20):
img_num = np.random.randint(0, 698)
x0 = Z_PCA.loc[img_num, 'Component 1'] - (x_size / 2.)
y0 = Z_PCA.loc[img_num, 'Component 2'] - (y_size / 2.)
x1 = Z_PCA.loc[img_num, 'Component 1'] + (x_size / 2.)
y1 = Z_PCA.loc[img_num, 'Component 2'] + (y_size / 2.)
img = df.iloc[img_num,:].values.reshape(pixels_per_dimension, pixels_per_dimension)
ax.imshow(img, aspect='auto', cmap=plt.cm.gray,
interpolation='nearest', zorder=100000, extent=(x0, x1, y0, y1))
# Show 2D components plot
ax.scatter(Z_PCA['Component 1'], Z_PCA['Component 2'], marker='.',alpha=0.7)
ax.set_ylabel('Down-Up Pose')
ax.set_xlabel('Left-Right Pose')
plt.show()
Using the implementation of PCA written in Question 1, I projected the images into the top two principal components. Clearly, the projection from ISOMAP is much more meaningful than the projection using PCA. The PCA projection is unable to capture the manifold structure of the data, manifesting as much weaker patterns when trying to discern the left-right and down-up poses by the faces.
Import food consumption data
df = pd.read_csv('data/food-consumption.csv')
Create numpy array from DataFrame
country_array = df.drop('Country', axis=1).to_numpy()
country_array = country_array.T
The data matrix for this part is set up so that each column corresponds to a country and each row corresponds to a food. The country name column was removed before converting the dataframe to a matrix.
Performing PCA and extracting top 2 eigenvalue/eigenvector pairs
# adapted from 6740 demo code
# PCA
m, n = country_array.shape
mu_1 = np.mean(country_array,axis = 1)
xc = country_array - mu_1[:,None]
C = np.dot(xc,xc.T)/m
# extract first 2 eigenvectors
K = 2
S,W = ll.eigs(C,k = K)
S = S.real
W = W.real
dim1_1 = np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalues
dim2_1 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue
# adpated from: https://stackoverflow.com/questions/14432557/matplotlib-scatter-plot-with-different-text-at-each-data-point
z1 = dim1_1
z2 = dim2_1
n = list(df['Country'])
fig, ax = plt.subplots()
ax.scatter(z1, z2)
for i, txt in enumerate(n):
ax.annotate(txt, (z1[i], z2[i]))
There appear to be some regional groupings in primary food consumption for different countries. Northern European countries are grouped in the bottom right, while southwestern European countries are grouped on the left. This leads me to believe that those regional country groupings tend to consume similar foods. Interestingly, England is plotted relatively far from any other countries, possibly meaning that the food consumption there differs significantly from in other countries.
The data matrix for this part is set up so that each row corresponds to a country and each column corresponds to a food. The country name column was removed before converting the dataframe to a matrix.
country_array = df.drop('Country', axis=1).to_numpy()
# adapted from 6740 demo code
# PCA
m, n = country_array.shape
mu_1 = np.mean(country_array,axis = 1)
xc = country_array - mu_1[:,None]
C = np.dot(xc,xc.T)/m
# extract first 2 eigenvectors
K = 2
S,W = ll.eigs(C,k = K)
S = S.real
W = W.real
dim1_1 = np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalues
dim2_1 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue
# adpated from: https://stackoverflow.com/questions/14432557/matplotlib-scatter-plot-with-different-text-at-each-data-point
z1 = dim1_1
z2 = dim2_1
n = list(df.columns)[1:]
fig, ax = plt.subplots()
ax.scatter(z1, z2)
for i, txt in enumerate(n):
ax.annotate(txt, (z1[i], z2[i]))
Along the 1st principal component (along the x-axis), garlic and olive oil appear to have fairly average consumption by country (whereas real coffee and potatoes are located at each extreme in terms of the 1st principal component). However, along the 2nd principal component, garlic and olive oil have significantly lower values than the rest of the foods. This points to garlic and olive oil being a food that represents a distinction between different countries consumption habits.