0

I'm using the package Prince to perform a FAMD on data that consists of mixed data (so both categorical and non-categorical).

My code is the following:

famd = prince.FAMD(n_components=10, n_iter=3, copy=True, check_input=True, engine='auto', random_state=42)
famd = famd.fit(df_pca)

Which gives as output

Explained inertia
[0.08057161 0.05946225 0.03875787 0.03203083 0.02978785 0.02868602
 0.02499968 0.02416245 0.02207422 0.02055546]

I have already tried df = pd.DataFrame(pca.components_, columns=list(dfPca.columns)) as mentioned in PCA on sklearn - how to interpret pca.components_ . Next to that I have attempted to implement the solution offered by user seralouk with some minor changes to make it fit the Prince FAMD.

n_pcs = len(inertia)
most_important = [inertia[i].argmax() for i in range(n_pcs)]
initial_feature_names = df_pca.columns
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
pca_results = pd.DataFrame(dic.items())

However this does not appear to work for the Prince FAMD. Are there any ways to link the output of the FAMD to the original variable names?

Marie M
  • 13
  • 3

1 Answers1

0

The link you cited is for a pca in sklearn. You are using a famd from another package now which is quite different altogether.

In the link cited, the solution by @seralouk basically goes through the eigenvector for each PC and takes out the column with highest absolute weight. Note this is NOT linking each component to the original column. This is finding the original column which contribute most to the PC.

You can do something like below, but I would suggest reading up on FAMD / PCA in this book to be sure of what you are actually extracting:

Below is a rough implementation to get the columns that contribute most to each component, using the V matrix. Using the example in prince help page:

import pandas as pd

X = pd.DataFrame(
    data=[
    ['A', 'A', 'A', 2, 5, 7, 6, 3, 6, 7],
    ['A', 'A', 'A', 4, 4, 4, 2, 4, 4, 3],
    ['B', 'A', 'B', 5, 2, 1, 1, 7, 1, 1],
    ['B', 'A', 'B', 7, 2, 1, 2, 2, 2, 2],
    ['B', 'B', 'B', 3, 5, 6, 5, 2, 6, 6],
    ['B', 'B', 'A', 3, 5, 4, 5, 1, 7, 5]
    ],
    columns=['E1 fruity', 'E1 woody', 'E1 coffee',
    'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
    'E3 fruity', 'E3 butter', 'E3 woody'],
    index=['Wine {}'.format(i+1) for i in range(6)]
    )

import prince
import numpy as np

n_pcs = 3
famd = prince.FAMD(n_components=n_pcs)
famd = famd.fit(X)

most_important = np.abs(famd.V_).argmax(axis=1)
initial_feature_names = X.columns
most_important_names = initial_feature_names[most_important]
dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}
pca_results = pd.DataFrame(dic.items())

     0             1
0  PC1     E1 coffee
1  PC2      E1 woody
2  PC3  E2 red fruit
StupidWolf
  • 45,075
  • 17
  • 40
  • 72