Convert Pipelined RDD to Dataframe in Pyspark

Question

This is how my pipelined RDD looks:

[([3.0, 12.0, 8.0, 49.0, 27.0], 7968.0),
 ([165.0, 140.0, 348.0, 615.0, 311.0], 165.0)]

I want to convert this to a dataframe. I have tried converting the first element (in square brackets) to an RDD and the second one to an RDD and then convert them individually to dataframes. I have also tried setting a schema and converting it but it has not worked. Can anybody help?

Thanks!

Have you tried `myrdd.toDF()`? You can also specify column names: `myrdd.toDF(["col1", "col2"])` — pault, May 02 '18 at 14:45

score 0 · Answer 1 · answered May 02 '18 at 10:39

0

You need to flatten your RDD before converting to a DataFrame:

df=rdd.map(lambda (x,y): x+[y]).toDF()

You can specify the schema argument of toDF() to get meaningful column names and/or types.

answered May 02 '18 at 10:39

ags29

2,621
1
8
14

This is not true. You do not have the flatten the rdd first. You can call `toDF()` directly. – pault May 02 '18 at 14:45

Convert Pipelined RDD to Dataframe in Pyspark

1 Answers1