Questions tagged [pca]

Principal component analysis (PCA) is a statistical technique for dimension reduction often used in clustering or factor analysis. Given any number of explanatory or causal variables, PCA ranks the variables by their ability to explain greatest variation in the data. It is this property that allows PCA to be used for dimension reduction, i.e. to identify the most important variables from amongst a large set possible influences.

Overview

Principal component analysis (PCA) is a statistical technique for dimension reduction often used in clustering or factor analysis. Given any number of explanatory or causal variables, PCA ranks the variables by their ability to explain greatest variation in the data. It is this property that allows PCA to be used for dimension reduction, i.e. to identify the most important variables from amongst a large set possible influences.

Mathematically, principal component analysis (PCA) amounts to an orthogonal transformation of possibly correlated variables (vectors) into uncorrelated variables called principal component vectors.

Tag usage

Questions on tag should be about implementation and programming problems, not about the statistical or theoretical properties of the technique. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.

In scientific software for statistical computing and graphics, functions princomp and prcomp compute PCA.

2728 questions
115
votes
11 answers

Principal component analysis in Python

I'd like to use principal component analysis (PCA) for dimensionality reduction. Does numpy or scipy already have it, or do I have to roll my own using numpy.linalg.eigh? I don't just want to use singular value decomposition (SVD) because my input…
Vebjorn Ljosa
  • 17,438
  • 13
  • 70
  • 88
100
votes
5 answers

Recovering features names of explained_variance_ratio_ in PCA with sklearn

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant. A classic example with IRIS dataset. import pandas as pd import pylab as pl from sklearn import datasets from sklearn.decomposition import PCA # load…
sereizam
  • 2,048
  • 3
  • 20
  • 29
80
votes
11 answers

Principal Component Analysis (PCA) in Python

I have a (26424 x 144) array and I want to perform PCA over it using Python. However, there is no particular place on the web that explains about how to achieve this task (There are some sites which just do PCA according to their own - there is no…
khan
  • 7,005
  • 15
  • 48
  • 70
75
votes
3 answers

Feature/Variable importance after a PCA analysis

I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the…
fbm
  • 753
  • 1
  • 6
  • 5
68
votes
2 answers

Principal components analysis using pandas dataframe

How can I calculate Principal Components Analysis from data in a pandas dataframe?
59
votes
3 answers

Obtain eigen values and vectors from sklearn PCA

How I can get the the eigen values and eigen vectors of the PCA application? from sklearn.decomposition import PCA clf=PCA(0.98,whiten=True) #converse 98% variance X_train=clf.fit_transform(X_train) X_test=clf.transform(X_test) I can't find…
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
48
votes
11 answers

raise LinAlgError("SVD did not converge") LinAlgError: SVD did not converge in matplotlib pca determination

Code: import numpy from matplotlib.mlab import PCA file_name = "store1_pca_matrix.txt" ori_data = numpy.loadtxt(file_name,dtype='float', comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False,…
user 3317704
  • 925
  • 2
  • 10
  • 21
46
votes
3 answers

Python scikit learn pca.explained_variance_ratio_ cutoff

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained. However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to…
Chubaka
  • 2,933
  • 7
  • 43
  • 58
46
votes
5 answers

Selecting multiple odd or even columns/rows for dataframe

Is there a way in R to select many non-consecutive i.e. odd or even rows/columns? I'm plotting the loadings for my Principal Components Analysis. I have 84 rows of data ordered like this: x_1 y_1 x_2..... x_42 y_42 And at the moment I am creating…
dmt
  • 2,113
  • 3
  • 24
  • 23
42
votes
8 answers

Plotting pca biplot with ggplot2

I wonder if it is possible to plot pca biplot results with ggplot2. Suppose if I want to display the following biplot results with ggplot2 fit <- princomp(USArrests, cor=TRUE) summary(fit) biplot(fit) Any help will be highly appreciated. Thanks
MYaseen208
  • 22,666
  • 37
  • 165
  • 309
34
votes
2 answers

PCA on sklearn - how to interpret pca.components_

I ran PCA on a data frame with 10 features using this simple code: pca = PCA() fit = pca.fit(dfPca) The result of pca.explained_variance_ratio_ shows: array([ 5.01173322e-01, 2.98421951e-01, 1.00968655e-01, 4.28813755e-02, …
Diego
  • 34,802
  • 21
  • 91
  • 134
32
votes
2 answers

R function prcomp fails with NA's values even though NA's are allowed

I am using the function prcomp to calculate the first two principal components. However, my data has some NA values and therefore the function throws an error. The na.action defined seems not to work even though it is mentioned in the help file…
user969113
  • 2,349
  • 10
  • 44
  • 51
31
votes
2 answers

PCA projection and reconstruction in scikit-learn

I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns. from sklearn.decomposition import PCA pca = PCA(n_components=30) X_train_pca = pca.fit_transform(X_train) Now, when I want to project the eigenvectors onto feature…
HonzaB
  • 7,065
  • 6
  • 31
  • 42
31
votes
3 answers

Factor Loadings using sklearn

I want the correlations between individual variables and principal components in python. I am using PCA in sklearn. I don't understand how can I achieve the loading matrix after I have decomposed my data? My code is here. iris = load_iris() data, y…
Riyaz
  • 1,430
  • 2
  • 17
  • 27
28
votes
1 answer

R internal handling of sparse matrices

I have been comparing the performance of several PCA implementations from both Python and R, and noticed an interesting behavior: While it seems impossible to compute the PCA of a sparse matrix in Python (the only approach would be scikit-learn's…
dennlinger
  • 9,890
  • 1
  • 42
  • 63
1
2 3
99 100