16

I trying to do a simple principal component analysis with matplotlib.mlab.PCA but with the attributes of the class I can't get a clean solution to my problem. Here's an example:

Get some dummy data in 2D and start PCA:

from matplotlib.mlab import PCA
import numpy as np

N     = 1000
xTrue = np.linspace(0,1000,N)
yTrue = 3*xTrue

xData = xTrue + np.random.normal(0, 100, N)
yData = yTrue + np.random.normal(0, 100, N)
xData = np.reshape(xData, (N, 1))
yData = np.reshape(yData, (N, 1))
data  = np.hstack((xData, yData))
test2PCA = PCA(data)

Now, I just want to get the principal components as vectors in my original coordinates and plot them as arrows onto my data.

What is a quick and clean way to get there?

Thanks, Tyrax

djvg
  • 11,722
  • 5
  • 72
  • 103
Tyrax
  • 223
  • 2
  • 3
  • 7

2 Answers2

30

I don't think the mlab.PCA class is appropriate for what you want to do. In particular, the PCA class rescales the data before finding the eigenvectors:

a = self.center(a)
U, s, Vh = np.linalg.svd(a, full_matrices=False)

The center method divides by sigma:

def center(self, x):
    'center the data using the mean and sigma from training set a'
    return (x - self.mu)/self.sigma

This results in eigenvectors, pca.Wt, like this:

[[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]

They are perpendicular, but not directly relevant to the principal axes of your original data. They are principal axes with respect to massaged data.

Perhaps it might be easier to code what you want directly (without the use of the mlab.PCA class):

import numpy as np
import matplotlib.pyplot as plt

N = 1000
xTrue = np.linspace(0, 1000, N)
yTrue = 3 * xTrue
xData = xTrue + np.random.normal(0, 100, N)
yData = yTrue + np.random.normal(0, 100, N)
xData = np.reshape(xData, (N, 1))
yData = np.reshape(yData, (N, 1))
data = np.hstack((xData, yData))

mu = data.mean(axis=0)
data = data - mu
# data = (data - mu)/data.std(axis=0)  # Uncommenting this reproduces mlab.PCA results
eigenvectors, eigenvalues, V = np.linalg.svd(data.T, full_matrices=False)
projected_data = np.dot(data, eigenvectors)
sigma = projected_data.std(axis=0).mean()
print(eigenvectors)

fig, ax = plt.subplots()
ax.scatter(xData, yData)
for axis in eigenvectors:
    start, end = mu, mu + sigma * axis
    ax.annotate(
        '', xy=end, xycoords='data',
        xytext=start, textcoords='data',
        arrowprops=dict(facecolor='red', width=2.0))
ax.set_aspect('equal')
plt.show()

enter image description here

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • what the meaning of the 1.618 constant is ? where it comes from ? – joaquin Aug 18 '13 at 15:13
  • 1
    @joaquin: Its approximately the [golden ratio](http://en.wikipedia.org/wiki/Golden_ratio). You can, of course, choose any constant you like, but it [often looks good](http://en.wikipedia.org/wiki/Golden_ratio#Painting). – unutbu Aug 18 '13 at 15:44
  • @unutbu: The two vectors are not orthogonal, something must be wrong here. – Tyrax Aug 19 '13 at 11:07
  • Thanks, that helps a lot. I was wondering why pca.Wt had those strange values. I'm still surprised that the pca-class is not really usable for such a simple pca task. The documentation at http://matplotlib.org/api/mlab_api.html#matplotlib.mlab.PCA is unusable imho. There's a better one here: https://www.clear.rice.edu/comp130/12spring/pca/pca_docs.shtml but I was still not able to really understand what's going on. – Tyrax Aug 19 '13 at 14:05
  • Is it coincidence that both vectors are of the same length? Because they are also in my plot. The eigenvalues tell how long they should be, right? – Ben Feb 27 '20 at 15:24
  • Doesn't `numpy.linalg.svd()` return the "singular values" (`s`) instead of the `eigenvalues`? As explained e.g. [here](https://stats.stackexchange.com/a/134283): `eigenvalue=s**2/(n-1)` (for the eigenvalues of the covariance matrix). From the [svd docs](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html?highlight=svd#numpy.linalg.svd): "... the corresponding (possibly non-zero) eigenvalues are given by `s**2`" – djvg Jun 24 '21 at 19:29
3

Note that matplotlib.mlab.PCA was removed in 3.1.

Below are three alternative PCA implementations, one based on the lastmatplotlib.mlab.PCA implementation, one based on unutbu's answer, and one based on doug's answer to another question.

The first two use singular value decomposition (svd) to obtain the eigenvalues and eigenvectors, the latter uses a covariance matrix (cov) approach.

A magnificent explanation of the relation between the svd and cov approaches is found here.

The implementations have been simplified and refactored for easy comparison.

def pca_svd(data):
    """ based on matplotlib.mlab.PCA with standardize=False """
    data -= data.mean(axis=0)
    __, singular_values, eigenvectors_transposed = numpy.linalg.svd(
        data, full_matrices=False)
    eigenvalues = singular_values ** 2 / (data.shape[0] - 1)
    eigenvectors = eigenvectors_transposed.T
    transformed_data = numpy.dot(data, eigenvectors)
    return transformed_data, eigenvalues, eigenvectors


def pca_svd_transposed(data):
    """ based on unutbu's answer """
    data -= data.mean(axis=0)
    eigenvectors, singular_values, __ = numpy.linalg.svd(
        data.T, full_matrices=False)  # note data transposed
    eigenvalues = singular_values ** 2 / (data.shape[0] - 1)
    transformed_data = numpy.dot(data, eigenvectors)
    return transformed_data, eigenvalues, eigenvectors
    
    
def pca_cov(data):
    """ based on doug's answer """
    data -= data.mean(axis=0)
    covariance_matrix = numpy.cov(data, rowvar=False)
    eigenvalues, eigenvectors = scipy.linalg.eigh(covariance_matrix)
    decreasing_order = numpy.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[decreasing_order]
    eigenvectors = eigenvectors[:, decreasing_order]
    transformed_data = numpy.dot(data, eigenvectors)
    return transformed_data, eigenvalues, eigenvectors

The eigenvalues represent the variance of the data along the principal axes, i.e. the variance of the transformed_data.

Timing using timeit reveals the following on my system:

array shape:  (15000, 4)
iterations:  1000
pca_svd_transposed: 4.32 s (average 4.32 ms)
pca_svd:            1.87 s (average 1.87 ms)
pca_cov:            1.41 s (average 1.41 ms)

Note that the svd of the transposed input array is relatively slow for this array shape.

djvg
  • 11,722
  • 5
  • 72
  • 103