18

I am trying to use the mca package to do multiple correspondence analysis in Python.

I am a bit confused as to how to use it. With PCA I would expect to fit some data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transform unseen data.

Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E, .L, .K, .k etc).

So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like

import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))

from what I can gather

ca.fs_r(1)

is the transformation of the data in df and

ca.L

is supposed to be the eigenvalues (although I get a vector of 1s that is one element fewer that my number of features?).

now if I had some more data with the same features, let's say df_new and assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1) for the new data

Dan
  • 45,079
  • 17
  • 88
  • 157
  • 1
    Judging from the [mca usage guide](https://github.com/esafak/mca/blob/master/docs/usage.rst), you need to use `ca.fs_r_sup(df_new)` to project your new data. Does this help? – Jan Trienes Feb 03 '18 at 15:01
  • But how do you know from that document? I guess you could infer it because ` mca_counts.fs_r_sup(new_counts, 2)` takes a variable called `new_counts` but does it actually document what each function and property are supposed to do somewhere? – Dan Feb 04 '18 at 00:30
  • 1
    @JanTrienes you might as well add that as an answer to claim the bounty – Dan Feb 08 '18 at 10:43
  • Done. Nevertheless, I feel that my answer can be improved with appropriate background info on MCA. Also, I agree that the package is not very well documented, which makes things harder. – Jan Trienes Feb 08 '18 at 11:08
  • @JanTrienes Yup, and the single letter variable and function names do not help much either – Dan Feb 08 '18 at 14:37
  • 1
    You can try looking at [Prince](https://github.com/MaxHalford/prince), it is very well documented and easy to use. – Axois Jul 27 '19 at 06:15
  • @Axois thanks! This looks great :) If you know the package, could you post an answer here demonstrating how to use it to fit and transform categorical data? – Dan Jul 27 '19 at 09:39

2 Answers2

36

One other method is to use the library prince which enables easy usage of tools such as:

  1. Multiple correspondence analysis (MCA)
  2. Principal component analysis (PCA)
  3. Multiple factor analysis (MFA)

You can begin first by installing with:

pip install --user prince

To use MCA, it is fairly simple and can be done in a couple of steps (just like sklearn PCA method.) We first build our dataframe.

import pandas as pd 
import prince

X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']

print(X.head())

mca = prince.MCA()

# outputs
>>     Color   Size   Action    Age Inflated
   0  YELLOW  SMALL  STRETCH  ADULT        T
   1  YELLOW  SMALL  STRETCH  CHILD        F
   2  YELLOW  SMALL      DIP  ADULT        F
   3  YELLOW  SMALL      DIP  CHILD        F
   4  YELLOW  LARGE  STRETCH  ADULT        T

Followed by calling the fit and transform method.

mca = mca.fit(X) # same as calling ca.fs_r(1)
mca = mca.transform(X) # same as calling ca.fs_r_sup(df_new) for *another* test set.
print(mca)

# outputs
>>         0             1
0   0.705387  8.373126e-15
1  -0.386586  8.336230e-15
2  -0.386586  6.335675e-15
3  -0.852014  6.726393e-15
4   0.783539 -6.333333e-01
5   0.783539 -6.333333e-01
6  -0.308434 -6.333333e-01
7  -0.308434 -6.333333e-01
8  -0.773862 -6.333333e-01
9   0.783539  6.333333e-01
10  0.783539  6.333333e-01
11 -0.308434  6.333333e-01
12 -0.308434  6.333333e-01
13 -0.773862  6.333333e-01
14  0.861691 -5.893240e-15
15  0.861691 -5.893240e-15
16 -0.230282 -5.930136e-15
17 -0.230282 -7.930691e-15
18 -0.695710 -7.539973e-15

You can even print out the picture diagram of it, since it incorporates matplotlib library.

ax = mca.plot_coordinates(
     X=X,
     ax=None,
     figsize=(6, 6),
     show_row_points=True,
     row_points_size=10,
     show_row_labels=False,
     show_column_points=True,
     column_points_size=30,
     show_column_labels=False,
     legend_n_cols=1
     )

ax.get_figure().savefig('images/mca_coordinates.svg')

mca

Axois
  • 1,961
  • 2
  • 11
  • 22
  • 4
    Above code gives error. Need to remove assignment at mca.transform(X) function. This function returns DataFrame which shouldn't be assigned to mca again. – Pallavi Jun 17 '21 at 14:14
  • Hello looks like the function plot_coordinates() is deprecated and need to use plot() instead. Any ideas on how to do labelling of the plot using plot() function? – Kenneth Singh Jun 14 '23 at 16:17
11

The documentation of the mca package is not very clear with that regard. However, there are a few cues which suggest that ca.fs_r_sup(df_new) should be used to project new (unseen) data onto the factors obtained in the analysis.

  1. The package author refers to new data as supplementary data which is the terminology used in following paper: Abdi, H., & Valentin, D. (2007). Multiple correspondence analysis. Encyclopedia of measurement and statistics, 651-657.
  2. The package has only two functions which accept new data as parameter DF: fs_r_sup(self, DF, N=None) and fs_c_sup(self, DF, N=None). The latter is to find the column factor scores.
  3. The usage guide demonstrates this based on a new data frame which has not been used throughout the component analysis.
Jan Trienes
  • 2,501
  • 1
  • 16
  • 28
  • 1
    Although this directly answers the question so is technically the correct solution, for anyone reading this it is worth taking a look at the answer by [Axios](https://stackoverflow.com/a/57237247/1011724) for a better alternative to the library in question. – Dan Jul 29 '19 at 08:57