7

I use PySpark.

Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector.

I've tried the following:

output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0]))

but I get the error that 'col should be Column'.

Any suggestions on how to transform a column of vectors into columns of its values?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Petrichor
  • 111
  • 1
  • 7

4 Answers4

3

I figured out the problem with the suggestion above. In pyspark, "dense vectors are simply represented as NumPy array objects", so the issue is with python and numpy types. Need to add .item() to cast a numpy.float64 to a python float.

The following code works:

split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())

output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))

Or to append these columns to the original dataframe:

randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))
Petrichor
  • 111
  • 1
  • 7
3

Got the same problem, below is the code adjusted for the situation when you have n-length vector.

splits = [udf(lambda value: value[i].item(), FloatType()) for i in range(n)]
out =  tstDF.select(*[s('features').alias("Column"+str(i)) for i, s in enumerate(splits)])
Rookie Boy
  • 39
  • 1
  • 3
2

You may want to use one UDF to extract the first value and another to extract the second. You can then use the UDF with a select call on the output of the random forrest data frame. Example:

from pyspark.sql.functions import udf, col

split1_udf = udf(lambda value: value[0], FloatType())
split2_udf = udf(lambda value: value[1], FloatType())
output2 = randomForrestOutput.select(split1_udf(col("probability")).alias("c1"),
                                     split2_udf(col("probability")).alias("c2"))

This should give you a dataframe output2 which has columns c1 and c2 corresponding to the first and second values in the list stored in the column probability.

Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
Saif Charaniya
  • 131
  • 1
  • 8
  • 1
    I tried your suggestion, but it produces an error, similar to the one mentioned here: http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class – Petrichor May 19 '16 at 17:12
0

I tried @Rookie Boy 's loop but it seems the splits udf loop doesn't work for me. I modified a bit.

out = df
for i in range(len(n)):
    splits_i = udf(lambda x: x[i].item(), FloatType())
    out = out.withColumn('{col_}'.format(i), splits_i('probability'))
out.select(*['col_{}'.format(i) for i in range(3)]).show()