A vanilla Python implementation is available here Categorical features correlation
What's the best way to implement the same in PySpark?
A vanilla Python implementation is available here Categorical features correlation
What's the best way to implement the same in PySpark?
I went about to do it the following way:
def cramers_v(df, feature1, feature2):
contingency_matrix = c16.crosstab(feature1, feature2)
contingency_matrix = contingency_matrix.toPandas().drop(feature1+'_'+feature2, axis=1)
chi2 = ss.chi2_contingency(contingency_matrix)[0]
n = contingency_matrix.sum().sum()
phi2 = chi2 / n
r, k = contingency_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
Just need to create the contingency matrix using the in-built spark function. Usually this matrices are small enough to fit in memory so then convert the matrix into pandas dataframe. After that simply make use of the code linked in the question.
It's not a lot of change but may be useful for people who also have very large spark dataframes that do not fit in memory.