How to split column of vectors into two columns?

Question

I use PySpark.

Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector.

I've tried the following:

output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0]))

but I get the error that 'col should be Column'.

Any suggestions on how to transform a column of vectors into columns of its values?

score 3 · Answer 1 · answered May 19 '16 at 18:12

I figured out the problem with the suggestion above. In pyspark, "dense vectors are simply represented as NumPy array objects", so the issue is with python and numpy types. Need to add .item() to cast a numpy.float64 to a python float.

The following code works:

split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())

output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))

Or to append these columns to the original dataframe:

randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))

score 3 · Answer 2 · answered Dec 26 '16 at 19:26

Got the same problem, below is the code adjusted for the situation when you have n-length vector.

splits = [udf(lambda value: value[i].item(), FloatType()) for i in range(n)]
out =  tstDF.select(*[s('features').alias("Column"+str(i)) for i, s in enumerate(splits)])

score 2 · Answer 3 · edited May 19 '16 at 11:10

You may want to use one UDF to extract the first value and another to extract the second. You can then use the UDF with a select call on the output of the random forrest data frame. Example:

from pyspark.sql.functions import udf, col

split1_udf = udf(lambda value: value[0], FloatType())
split2_udf = udf(lambda value: value[1], FloatType())
output2 = randomForrestOutput.select(split1_udf(col("probability")).alias("c1"),
                                     split2_udf(col("probability")).alias("c2"))

This should give you a dataframe output2 which has columns c1 and c2 corresponding to the first and second values in the list stored in the column probability.

I tried your suggestion, but it produces an error, similar to the one mentioned here: http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class — Petrichor, May 19 '16 at 17:12

score 0 · Answer 4 · answered Apr 29 '20 at 03:03

I tried @Rookie Boy 's loop but it seems the splits udf loop doesn't work for me. I modified a bit.

out = df
for i in range(len(n)):
    splits_i = udf(lambda x: x[i].item(), FloatType())
    out = out.withColumn('{col_}'.format(i), splits_i('probability'))
out.select(*['col_{}'.format(i) for i in range(3)]).show()

How to split column of vectors into two columns?

4 Answers4

Linked