0

I have a rdd as below

rdd_1 = ['"columns":["date","appname","appenv","appstate"]']

I want to convert it to a dataframe like below

+---------+
| columns |
+---------+
|date     |
|appname  |
|appenv   |
|appstate |
+---------+

What I tried: I tired to create a schema as below and use that to create the dataframe,but that did not work

rdd_1_schema = StructType(
    [
        StructField('columns',ArrayType(StringType()))
    ])

The error output with the schema is below

rdd1.toDF(schema=rdd_1_schema).show()

Error:

TypeError: StructType can not accept object '"columns": in type <type 'str'>

2nd Try: I tried using flatmap

rdd1.flatMap(lambda x: map(lambda e: (x[0], e), x[1])).toDF().show()

but it takes each string as elements of list e.g of the output below

+---+---+
| _1| _2|
+---+---+
| ''|  c|
+---+---+
Deb
  • 193
  • 1
  • 3
  • 20
  • hi there, so you are trying to create a new schema from the list `["date","appname","appenv","appstate"]`? – abiratsis Sep 02 '19 at 20:07
  • @Alexandros Biratsis no I am trying to create the dataframe from that rdd, what I tried with putting some schema, but that is not working. I have provided that details – Deb Sep 03 '19 at 05:33
  • 1
    toDf function does not expect a schema but a string list containing the column names – abiratsis Sep 03 '19 at 11:17
  • Possible duplicate of [Spark RDD to DataFrame python](https://stackoverflow.com/questions/39699107/spark-rdd-to-dataframe-python) – Habardeen Sep 03 '19 at 13:15
  • It is different from that,though its creation of a Dataframe but the problem here is different – Deb Sep 03 '19 at 13:24
  • If I understand correctly, you want to create a spark dataframe from an RDD which has a single column with string values in it and this RDD comes from a dictionary? – Habardeen Sep 03 '19 at 13:26
  • RDD does not have column its a key value with a list being the value which needs to be converted as Row and RDD does not return the key correctly though visually it looks as key value pair – Deb Sep 03 '19 at 13:27
  • Your question is not clear. Provide an example of what you are trying to get. Looks like rdd_1 is the logical schema of your dataframe and not the rdd itself. An RDD contains some data and does have columns indeed! See here: https://runawayhorse001.github.io/LearningApacheSpark/rdd.html – Habardeen Sep 03 '19 at 13:35

0 Answers0