5

I am currently running an exploratory factor analysis in Python, which works well with the factor_analyzer package (https://factor-analyzer.readthedocs.io/en/latest/factor_analyzer.html). To choose the appropriate number of factors, I used the Kaiser criterion and the Scree plot. However, I would like to confirm my results using Horn's parallel analysis (Horn, 1965). In R I would use the parallel function from the psych package. Does anyone know an equivalent method / function / package in Python? I've been searching for some time now, but unfortunately without success.

Thanks a lot for your help!

Best regards

rdas
  • 20,604
  • 6
  • 33
  • 46
MGr
  • 51
  • 3

2 Answers2

2

You've probably figured out a solution by now but, for the sake of others who might be looking for it, here's some code that I've used to mimic the parallel analysis from the psych library:

import pandas as pd
from factor_analyzer import FactorAnalyzer
import numpy as np
import matplotlib.pyplot as plt

def _HornParallelAnalysis(data, K=10, printEigenvalues=False):
    ################
    # Create a random matrix to match the dataset
    ################
    n, m = data.shape
    # Set the factor analysis parameters
    fa = FactorAnalyzer(n_factors=1, method='minres', rotation=None, use_smc=True)
    # Create arrays to store the values
    sumComponentEigens = np.empty(m)
    sumFactorEigens = np.empty(m)
    # Run the fit 'K' times over a random matrix
    for runNum in range(0, K):
        fa.fit(np.random.normal(size=(n, m)))
        sumComponentEigens = sumComponentEigens + fa.get_eigenvalues()[0]
        sumFactorEigens = sumFactorEigens + fa.get_eigenvalues()[1]
    # Average over the number of runs
    avgComponentEigens = sumComponentEigens / K
    avgFactorEigens = sumFactorEigens / K

    ################
    # Get the eigenvalues for the fit on supplied data
    ################
    fa.fit(data)
    dataEv = fa.get_eigenvalues()
    # Set up a scree plot
    plt.figure(figsize=(8, 6))

    ################
    ### Print results
    ################
    if printEigenvalues:
        print('Principal component eigenvalues for random matrix:\n', avgComponentEigens)
        print('Factor eigenvalues for random matrix:\n', avgFactorEigens)
        print('Principal component eigenvalues for data:\n', dataEv[0])
        print('Factor eigenvalues for data:\n', dataEv[1])
    # Find the suggested stopping points
    suggestedFactors = sum((dataEv[1] - avgFactorEigens) > 0)
    suggestedComponents = sum((dataEv[0] - avgComponentEigens) > 0)
    print('Parallel analysis suggests that the number of factors = ', suggestedFactors , ' and the number of components = ', suggestedComponents)


    ################
    ### Plot the eigenvalues against the number of variables
    ################
    # Line for eigenvalue 1
    plt.plot([0, m+1], [1, 1], 'k--', alpha=0.3)
    # For the random data - Components
    plt.plot(range(1, m+1), avgComponentEigens, 'b', label='PC - random', alpha=0.4)
    # For the Data - Components
    plt.scatter(range(1, m+1), dataEv[0], c='b', marker='o')
    plt.plot(range(1, m+1), dataEv[0], 'b', label='PC - data')
    # For the random data - Factors
    plt.plot(range(1, m+1), avgFactorEigens, 'g', label='FA - random', alpha=0.4)
    # For the Data - Factors
    plt.scatter(range(1, m+1), dataEv[1], c='g', marker='o')
    plt.plot(range(1, m+1), dataEv[1], 'g', label='FA - data')
    plt.title('Parallel Analysis Scree Plots', {'fontsize': 20})
    plt.xlabel('Factors/Components', {'fontsize': 15})
    plt.xticks(ticks=range(1, m+1), labels=range(1, m+1))
    plt.ylabel('Eigenvalue', {'fontsize': 15})
    plt.legend()
    plt.show();

If you call the above like this:

_HornParallelAnalysis(myDataSet)

You should get something like the following:

Example output for parallel analysis: enter image description here

Reza Rahemtola
  • 1,182
  • 7
  • 16
  • 30
2

Thanks for sharing Eric and Reza. Here I also provide a faster solution for those readers who do a PCA parallel analysis only. The above code is taking too long for me (apparently because of my very large dataset of size 33 x 15498) with no answer (I waited 1 day running it), so if anyone have only a PCA parallel analysis like my case, you can use this simple and very fast code, just you need to put your dataset in a csv file, this program reads in the csv and very fastly provides you with a PCA parallel analysis plot:

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

shapeMatrix = pd.read_csv("E:\\projects\\ankle_imp_ssm\\results\\parallel_analysis\\data\\shapeMatrix.csv")
shapeMatrix.dropna(axis=1, inplace=True)
normalized_shapeMatrix=(shapeMatrix-shapeMatrix.mean())/shapeMatrix.std()

pca = PCA(shapeMatrix.shape[0]-1)
pca.fit(normalized_shapeMatrix)
transformedShapeMatrix = pca.transform(normalized_shapeMatrix)
#np.savetxt("pca_data.csv", pca.explained_variance_, delimiter=",")

random_eigenvalues = np.zeros(shapeMatrix.shape[0]-1)
for i in range(100):
    random_shapeMatrix = pd.DataFrame(np.random.normal(0, 1, [shapeMatrix.shape[0], shapeMatrix.shape[1]]))
    pca_random = PCA(shapeMatrix.shape[0]-1)
    pca_random.fit(random_shapeMatrix)
    transformedRandomShapeMatrix = pca_random.transform(random_shapeMatrix)
    random_eigenvalues = random_eigenvalues+pca_random.explained_variance_ratio_
random_eigenvalues = random_eigenvalues / 100


#np.savetxt("pca_random.csv", random_eigenvalues, delimiter=",")

plt.plot(pca.explained_variance_ratio_, '--bo', label='pca-data')
plt.plot(random_eigenvalues, '--rx', label='pca-random')
plt.legend()
plt.title('parallel analysis plot')
plt.show()

Byy running this piece of code on the matrix of shapes for which I created a statistical shape model. (Shape matrix is of size: 33 x 15498) and it takes just a few seconds to run.

enter image description here

lost_phd
  • 323
  • 2
  • 6