2

I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet:

newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), ))) df = newRDD.toDF()

When I run the code though, I receive this error:

'list' object has no attribute 'encode'

I've tried multiple other combinations, such as converting it to a Pandas dataframe using:

newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), ))) df = newRDD.toPandas()

But then I end up receiving this error:

AttributeError: 'PipelinedRDD' object has no attribute 'toPandas'

Any help would be greatly appreciated. Thank you for your time.

1 Answers1

0

rdd.toDF() or rdd.toPandas() is only used for SparkSession.

To fix your code, try below:

spark = SparkSession.builder.getOrCreate()

rdd = spark.sparkContext.textFile()
newRDD = rdd.map(...)
df = newRDD.toDF() or newRDD.toPandas()
Zhang Tong
  • 4,569
  • 3
  • 19
  • 38
  • SparkSession is not available in Spark 1.6. SparkSession only became available in Spark 2.0. I cannot upgrade to Spark 2.0 –  Jul 07 '17 at 16:39