Cramér's V in PySpark

Question

A vanilla Python implementation is available here Categorical features correlation

What's the best way to implement the same in PySpark?

score 1 · Accepted Answer · answered Apr 06 '20 at 15:12

I went about to do it the following way:

def cramers_v(df, feature1, feature2):
    contingency_matrix = c16.crosstab(feature1, feature2)
    contingency_matrix = contingency_matrix.toPandas().drop(feature1+'_'+feature2, axis=1)
    chi2 = ss.chi2_contingency(contingency_matrix)[0]
    n = contingency_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = contingency_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

Just need to create the contingency matrix using the in-built spark function. Usually this matrices are small enough to fit in memory so then convert the matrix into pandas dataframe. After that simply make use of the code linked in the question.

It's not a lot of change but may be useful for people who also have very large spark dataframes that do not fit in memory.

Cramér's V in PySpark

1 Answers1