0

I am new in pyspark.
I want to count the correlation between a column(int) with another column(vector from onehotencoder).
I use this code:

import six
for i in df.columns:
    if not(isinstance(df.select(i).take(1)[0][0], six.string_types)):
        print( "Correlation to label for", i, df.stat.corr('label',i))

I got this error when it counts the correlation between label a onehotencoder column:

Py4JJavaError: An error occurred while calling o9219.corr. :
  java.lang.IllegalArgumentException:
    requirement failed:
      Currently correlation calculation for columns with dataType org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 not supported
Serdalis
  • 10,296
  • 2
  • 38
  • 58
Gregorius Edwadr
  • 399
  • 1
  • 3
  • 14
  • What error do you get? – Serdalis Sep 07 '18 at 09:53
  • Sorry I forgot to post the error. Here is the error: Py4JJavaError: An error occurred while calling o9219.corr. : java.lang.IllegalArgumentException: requirement failed: Currently correlation calculation for columns with dataType org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 not supported. – Gregorius Edwadr Sep 07 '18 at 09:54
  • 1
    Added it to the question for you :) – Serdalis Sep 07 '18 at 09:57
  • Do you have a minimal example to replicate? I guess you have to do an appropriate type cast in order to make the correlation work. – pansen Sep 07 '18 at 11:07
  • @pansen the column I tried to calculate the correlation is one hot encoder therefore it is a vector. To what type should I change its type? – Gregorius Edwadr Sep 10 '18 at 02:39
  • Mmh, seems like you need to convert your vector column into regular numeric columns. See [here](https://stackoverflow.com/questions/38384347/how-to-split-vector-into-columns-using-pyspark?rq=1). – pansen Sep 10 '18 at 10:57

0 Answers0