0

Using pyspark 2.1,I am trying to find some predicted result and the code is given below

restultSet=testingData.map(lambda p: (p.label, linearModel.predict(p.features))).collect()

Now the output of restultSet is in list and it looks like below

[(2.0, array([ 2.09078012])), (2.0, array([ 2.09078012])), (2.0, array([ 2.09078012])), (1.0, array([ 2.09078012])), (2.0, array([ 2.09078012])), (1.0, array([ 2.09078012]))]

When I am have give type(restultSet) it is showing me below result

I am struggling to convert from list to dataframe

I tried to use below snippets it didn't work . Please help

restultSet.toDF()
desertnaut
  • 57,590
  • 26
  • 140
  • 166
user3292373
  • 483
  • 3
  • 8
  • 25

1 Answers1

2

You cannot convert restultSet to a Spark dataframe because, due to collect, it is a Python list, and toDF works for RDDs.

Removing collect, and adding one more map to convert your numpy arrays to Vectors.dense should do the job.

Here is an example with the data you have provided:

import numpy as np
from pyspark.ml.linalg import Vectors

# your data as an RDD (i.e. before 'collect')
dd = sc.parallelize([(2.0, np.array([ 2.09078012])), (2.0, np.array([ 2.09078012])), (2.0, np.array([ 2.09078012])), (1.0, np.array([ 2.09078012])), (2.0, np.array([ 2.09078012])), (1.0, np.array([ 2.09078012]))])
dd.take(1)
# [(2.0, array([ 2.09078012]))]

df = dd.map(lambda x: (x[0], Vectors.dense(x[1]))).toDF()
df.show()
# +---+------------+ 
# | _1|          _2|
# +---+------------+
# |2.0|[2.09078012]| 
# |2.0|[2.09078012]|
# |2.0|[2.09078012]|
# |1.0|[2.09078012]|
# |2.0|[2.09078012]|
# |1.0|[2.09078012]|
# +---+------------+

To give names to the resulting columns, include them as a list argument in toDF, i.e. toDF(["column_1", "column_2"]).

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    My intention was `.toDf` on `List`(Not on Rdd). As it will comes only with implicits in scala. Anyways thanks for pointing out. – mrsrinivas Oct 19 '17 at 12:04