Pyspark correlation with one hot encoding column

Asked Sep 07 '18 at 09:50

Active Sep 07 '18 at 09:57

Viewed 360 times

I am new in pyspark.
I want to count the correlation between a column(int) with another column(vector from onehotencoder).
I use this code:

import six
for i in df.columns:
    if not(isinstance(df.select(i).take(1)[0][0], six.string_types)):
        print( "Correlation to label for", i, df.stat.corr('label',i))

I got this error when it counts the correlation between label a onehotencoder column:

Py4JJavaError: An error occurred while calling o9219.corr. :
  java.lang.IllegalArgumentException:
    requirement failed:
      Currently correlation calculation for columns with dataType org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 not supported

edited Sep 07 '18 at 09:57

Serdalis

10,296
2
38
58

asked Sep 07 '18 at 09:50

Gregorius Edwadr

What error do you get? – Serdalis Sep 07 '18 at 09:53
Sorry I forgot to post the error. Here is the error: Py4JJavaError: An error occurred while calling o9219.corr. : java.lang.IllegalArgumentException: requirement failed: Currently correlation calculation for columns with dataType org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 not supported. – Gregorius Edwadr Sep 07 '18 at 09:54
1

Added it to the question for you :) – Serdalis Sep 07 '18 at 09:57
Do you have a minimal example to replicate? I guess you have to do an appropriate type cast in order to make the correlation work. – pansen Sep 07 '18 at 11:07
@pansen the column I tried to calculate the correlation is one hot encoder therefore it is a vector. To what type should I change its type? – Gregorius Edwadr Sep 10 '18 at 02:39
Mmh, seems like you need to convert your vector column into regular numeric columns. See [here](https://stackoverflow.com/questions/38384347/how-to-split-vector-into-columns-using-pyspark?rq=1). – pansen Sep 10 '18 at 10:57

Pyspark correlation with one hot encoding column

0 Answers0