I am wanting to calculate a chi-squared test statistic between pairs of columns in a pandas dataframe. It seems like there must be a way to do this in a similar fashion to pandas.corr
if I have the following data frame
df = pd.DataFrame([['a', 'x', 'a'],
['b', 'z', 'a'],
['a', 'x', 'a']],
columns=['ll', 'kk', 'jj'],
index=['nn', 'oo', 'pp'])
I would hope to be able to for something like:
df.corr('chisquare')
Though this will obviously fail. If the dataframe was numeric, not categorical I could simply do df.corr()
and pass either spearman or pearson. There must be a way of calculating chi-sqaured between all of the columns as well
So the output (using scipy.stats.chi2_contingency
) would be
ll kk jj
ll 0.0000 0.1875 0.0
kk 0.1875 0.0000 0.0
jj 0.0000 0.0000 0.0
Am I just missing something, or is this not possible without coding each step of the process individually. I am looking for something like pd.corr
but with categorical data.
EDIT: In order to clear up any confusion as to what I'm currently doing in order to get the resulting matrix:
from itertools import combinations
def get_corr_mat(df, f=chi2_contingency):
columns = df.columns
dm = pd.DataFrame(index=columns, columns=columns)
for var1, var2 in combinations(columns, 2):
cont_table = pd.crosstab(df[var1], df[var2], margins=False)
chi2_stat = f(cont_table)[0]
dm.loc[var2, var1] = chi2_stat
dm.loc[var1, var2] = chi2_stat
dm.fillna(0, inplace=True)
return dm
get_corr_mat(df)
As I've stated previously this does work, though it can get slow and is not tested. A pandas method would be much preferable