2

I am wanting to calculate a chi-squared test statistic between pairs of columns in a pandas dataframe. It seems like there must be a way to do this in a similar fashion to pandas.corr

if I have the following data frame

df = pd.DataFrame([['a', 'x', 'a'], 
                   ['b', 'z', 'a'], 
                   ['a', 'x', 'a']], 
                  columns=['ll', 'kk', 'jj'], 
                  index=['nn', 'oo', 'pp'])

I would hope to be able to for something like:

df.corr('chisquare')

Though this will obviously fail. If the dataframe was numeric, not categorical I could simply do df.corr() and pass either spearman or pearson. There must be a way of calculating chi-sqaured between all of the columns as well

So the output (using scipy.stats.chi2_contingency) would be

    ll      kk      jj
ll  0.0000  0.1875  0.0
kk  0.1875  0.0000  0.0
jj  0.0000  0.0000  0.0

Am I just missing something, or is this not possible without coding each step of the process individually. I am looking for something like pd.corr but with categorical data.

EDIT: In order to clear up any confusion as to what I'm currently doing in order to get the resulting matrix:

from itertools import combinations
def get_corr_mat(df, f=chi2_contingency):
    columns = df.columns
    dm = pd.DataFrame(index=columns, columns=columns)
    for var1, var2 in combinations(columns, 2):
        cont_table = pd.crosstab(df[var1], df[var2], margins=False)
        chi2_stat = f(cont_table)[0]
        dm.loc[var2, var1] = chi2_stat
        dm.loc[var1, var2] = chi2_stat
    dm.fillna(0, inplace=True)
    return dm

get_corr_mat(df) 

As I've stated previously this does work, though it can get slow and is not tested. A pandas method would be much preferable

johnchase
  • 13,155
  • 6
  • 38
  • 64
  • Could you explain in more detail how you determined what the output using `scipy.stats.chi2_contingency` would be? How are you counting the elements of your DataFrame? – Warren Weckesser Nov 18 '16 at 01:17
  • It would be a fair amount of code, however the basic idea is that I am creating a contingency table from pairs of columns (vectors) and then passing that to the `scipy.stats.chi2_contingency` function. There are likely many ways to achieve this, however it is surprising to me that there is a method that will do this for numeric but not categorical. It doesn't have to be a chi sqaure test either. I could see situations where another test would be desired – johnchase Nov 18 '16 at 17:11
  • *"I am creating a contingency table from pairs of columns (vectors)"* Sorry if I'm being slow, but this is still unclear. How are you creating a 2-d contingency table from three columns? (I can see how one could make a 3-d contingency table: count how many times each unique row occurs in the data, and then use the elements in each row as if they were named indices of the 3-d table, and put the count at that location.) – Warren Weckesser Nov 18 '16 at 20:56
  • That is essentially what I'm doing, for *each* unique pair of columns a contingency table is created and a test statistic is calculated and recorded and then put into the final matrix. This is exactly the way that `pd.corr` works, except I am using a different test and therefore reporting a different statistic – johnchase Nov 18 '16 at 21:33
  • Ah, I see. I don't know of any existing code to build that table for you. – Warren Weckesser Nov 19 '16 at 00:48

1 Answers1

0

Alternate Method 1

Another way to find chi-squared test statistic between pairs of columns along with heatmap visualisation:

def ch_calculate(df):
    factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values] 

    chi2, p_values =[], []

    for f in factors_paired:
        if f[0] != f[1]:
            chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))   
            chi2.append(chitest[0])
            p_values.append(chitest[1])
        else:      # for same factor pair
            chi2.append(0)
            p_values.append(0)

    chi2 = np.array(chi2).reshape((len(df.columns),len(df.columns))) # shape it as a matrix
    chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience
    fig, ax = plt.subplots(figsize=(30,30))
    sns.heatmap(chi2, annot = True)
    plt.show()

ch_calculate(df_categorical)

Where df_categorical is a dataframe with all nominal input variables of a dataset, for ordinal categorical variables I think it is better to use .corr(method='spearman') (spearman rank correlation coefficient)

Alternate Method 2 with Cramers V

Also I came across this Cramers V implementation to find degree of association between categorical variables: Categorical features correlation By using this, I created another function to create heatmap visualisation to find correlated categorical columns (In Cramers V, you will find values from 0 to 1 in heatmap where 0 means no association and 1 mean high association)

from itertools import combinations
from scipy.stats import chi2_contingency
import scipy.stats as ss
import seaborn as sns
def get_corr_mat(df, f=chi2_contingency):
        columns = df.columns
        dm = pd.DataFrame(index=columns, columns=columns)
        for var1, var2 in combinations(columns, 2):
            cont_table = pd.crosstab(df[var1], df[var2], margins=False)
            chi2_stat = cramers_v(cont_table.values)
            dm.loc[var2, var1] = chi2_stat
            dm.loc[var1, var2] = chi2_stat
        dm.fillna(1, inplace=True)
        return dm

def cramers_v(confusion_matrix):
        """ calculate Cramers V statistic for categorial-categorial association.
            uses correction from Bergsma and Wicher,
            Journal of the Korean Statistical Society 42 (2013): 323-328
        """
        chi2 = ss.chi2_contingency(confusion_matrix)[0]
        n = confusion_matrix.sum()
        phi2 = chi2 / n
        r, k = confusion_matrix.shape
        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
        rcorr = r - ((r-1)**2)/(n-1)
        kcorr = k - ((k-1)**2)/(n-1)
        return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

cat_corr= get_corr_mat(df_categorical)
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(cat_corr, annot = True)
plt.show()