Convert PipelinedRDD to dataframe

Question

I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet:

newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), ))) df = newRDD.toDF()

When I run the code though, I receive this error:

'list' object has no attribute 'encode'

I've tried multiple other combinations, such as converting it to a Pandas dataframe using:

newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), ))) df = newRDD.toPandas()

But then I end up receiving this error:

AttributeError: 'PipelinedRDD' object has no attribute 'toPandas'

Any help would be greatly appreciated. Thank you for your time.

score 0 · Answer 1 · answered Jul 07 '17 at 01:54

0

rdd.toDF() or rdd.toPandas() is only used for SparkSession.

To fix your code, try below:

spark = SparkSession.builder.getOrCreate()

rdd = spark.sparkContext.textFile()
newRDD = rdd.map(...)
df = newRDD.toDF() or newRDD.toPandas()

answered Jul 07 '17 at 01:54

Zhang Tong

SparkSession is not available in Spark 1.6. SparkSession only became available in Spark 2.0. I cannot upgrade to Spark 2.0 – Jul 07 '17 at 16:39

1 Answers1