0

I'm a bit puzzled: I calculated PCAs of the same dataset. Here's the workflow:

  • Orange 3.26: Read a .csv, PCA on 4 PCs (normalized variables), scatterplot
  • scikit-learn: Read the same .csv, standardizing of numerical values (StandardScaler(with_mean=True,with_std=True)), PCA (copy=True, iterated_power='auto', n_components=4, random_state=None, svd_solver='auto', tol=0.0, whiten=False)

The results differ in the numerical values for the single PCs:

Orange 3.26:

enter image description here

scikit-learn: enter image description here

Here is my code for scikit-learn-fu:

I have a pd.DataFrame, shape is (268,16). In a first step I slice the dataframe in two daframes:

  • A1: containing all rows and all features; shape(268, 13)
  • B1: containing the targets and the ID of every row; shape(268,3)

In a next step I standardize dataframe A1 with StandardScaler from sklearn.preprocessing:

a1 = StandardScaler(with_mean=True,with_std=True).fit_transform(A1)

The next step is the PCA:

pca1 = PCA(n_components=4)
principalComponents1 = pca1.fit_transform(a1)

The outputs are the scores and loadings - nothing special.

Perhaps a difference in normalization of the initial dataset? Any suggestions?

Markus
  • 15
  • 6
  • this can be due to a number of factors such as: the normalization, the PCA solver (eig, svd), the input argument `random_state=None` for the PCA in `sklearn` .... – seralouk Sep 08 '20 at 07:24
  • Thanks @seralouk. However, my impression is that the normalization should be the reason. Unfortunately I can't find the "default" settings for normalization in Orange3. – Markus Sep 08 '20 at 07:36
  • do it manually, standrization is easy to be done. here is the equations: https://stackoverflow.com/a/50879522/5025009 – seralouk Sep 08 '20 at 08:52
  • @seralouk: I did exactly the same in scikit-learn. I just want to understand the difference in Orange3 when you take the initial data, go to the PCA-widget, choose 4 PCs and enable the normalization-check box. When I take the standardized data (with µ=0 and s²=1) in Orange3 and perform a PCA without normalization, the PCs are again different (somewhere in between the screenshots in my question). – Markus Sep 08 '20 at 09:04
  • 2
    can you post the full code ?? – seralouk Sep 08 '20 at 10:15
  • I try... I will edit my initial question. – Markus Sep 08 '20 at 10:33

0 Answers0