0

I'm trying to construct co-occurrence matrix of my dataframe on Databricks using pyspark.pandas API.

I tried this method to construct the matrix. Constructing a co-occurrence matrix in python pandas

The code is working fine in pandas, but is throwing error with pyspark.pandas

coocc = psdf.T.dot(psdf)
coocc

I'm getting this error

TypeError: Unsupported type DataFrame

I checked the doc. https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.dot.html

pyspark.pandas.DataFrame.dot()

Takes series as input.

I tried to converting dataframe to series using psdf.squeeze(), it does not convert dataframe to series, as my dataframe has multiple columns.

Is there any way to change pyspark.pandas.Dataframe to pyspark.pandas.Series? Or Different method in pyspark.pandas to construct cooccurrence matrix

1 Answers1

0

I solved it using csr_matrix as dataframe has '1' and '0' as values

import scipy.sparse as sp

psdfx = sp.csr_matrix(psdf.astype(int).values)
psdfc = ptdfx.T * psdfx
psdfc.setdiag(0)
coocc = ps.DataFrame(psdfc.todense(), columns=psdf.columns, index=psdf.columns)
coocc

Ref: https://stackoverflow.com/a/37840528/19642283