How to convert list of dictionaries into Pyspark DataFrame

Question

I want to convert my list of dictionaries into DataFrame. This is the list:

mylist = 
[
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

This is my code:

from pyspark.sql.types import StringType

df = spark.createDataFrame(mylist, StringType())

df.show(2,False)

+-----------------------------------------+
|                                    value|
+-----------------------------------------+
|{type_activity_id=1,type_activity_id=xxx}|
|{type_activity_id=2,type_activity_id=yyy}|
|{type_activity_id=3,type_activity_id=zzz}|
+-----------------------------------------+

I assume that I should provide some mapping and types for each column, but I don't know how to do it.

Update:

I also tried this:

schema = ArrayType(
    StructType([StructField("type_activity_id", IntegerType()),
                StructField("type_activity_name", StringType())
                ]))
df = spark.createDataFrame(mylist, StringType())
df = df.withColumn("value", from_json(df.value, schema))

But then I get null values:

+-----+
|value|
+-----+
| null|
| null|
+-----+

score 37 · Answer 1 · answered Sep 10 '18 at 15:39

37

In the past, you were able to simply pass a dictionary to spark.createDataFrame(), but this is now deprecated:

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]
df = spark.createDataFrame(mylist)
#UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
#  warnings.warn("inferring schema from dict is deprecated,"

As this warning message says, you should use pyspark.sql.Row instead.

from pyspark.sql import Row
spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)
#+----------------+------------------+
#|type_activity_id|type_activity_name|
#+----------------+------------------+
#|1               |xxx               |
#|2               |yyy               |
#|3               |zzz               |
#+----------------+------------------+

Here I used ** (keyword argument unpacking) to pass the dictionaries to the Row constructor.

answered Sep 10 '18 at 15:39

pault

41,343
15
107
149

Thanks. Do you know why it was deprecated? – Markus Sep 10 '18 at 15:41
1

I am not sure why. As an aside, this is probably faster than converting to/from json. – pault Sep 10 '18 at 15:44
1

But this may not work when the structure of each dictionary (array element) is not same. – Adiga Jun 28 '20 at 09:31
1

Using the `spark.createDataFrame(Row(**x) for x in mylist)` method in PySpark 3.0.0, I'm getting downstream issues where values are placed in the wrong columns. Possibly related to https://issues.apache.org/jira/browse/SPARK-26200 – Daniel Himmelstein Aug 20 '20 at 19:30
how to make sure the values in dict are of correct type, or typecast them if neccssary? – Gadam Dec 10 '20 at 20:04

score 15 · Accepted Answer · edited Jul 30 '19 at 10:26

15

You can do it like this. You will get a dataframe with 2 columns.

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

myJson = sc.parallelize(mylist)
myDf = sqlContext.read.json(myJson)

Output :

+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|               1|               xxx|
|               2|               yyy|
|               3|               zzz|
+----------------+------------------+

edited Jul 30 '19 at 10:26

Arvind

87
13

answered Sep 09 '18 at 09:59

pissall

7,109
2
25
45

3

@Markus If `mylist` is an rdd. You can do `spark.read.json(sc.parallelize(mylist))` – pissall Sep 09 '18 at 10:06
unfortunately, it fails with "_corrupt_record" for some records – Andrew Matiuk May 29 '22 at 14:24
for example records with None {'fail' : None} – Andrew Matiuk May 29 '22 at 14:38

score 5 · Answer 3 · answered Feb 27 '20 at 07:19

in Spark Version 2.4 it is possible to do directly with df=spark.createDataFrame(mylist)

>>> mylist = [
...   {"type_activity_id":1,"type_activity_name":"xxx"},
...   {"type_activity_id":2,"type_activity_name":"yyy"},
...   {"type_activity_id":3,"type_activity_name":"zzz"}
... ]
>>> df1=spark.createDataFrame(mylist)
>>> df1.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|               1|               xxx|
|               2|               yyy|
|               3|               zzz|
+----------------+------------------+

It still gives me this warning though `UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead` — Adiga, Jun 28 '20 at 04:55

score 0 · Answer 4 · answered Jul 08 '20 at 05:22

I was also facing the same issue when creating dataframe from list of dictionaries. I have resolved this using namedtuple.

Below is my code using data provided.

from collections import namedtuple
final_list = []
mylist = [{"type_activity_id":1,"type_activity_name":"xxx"},
          {"type_activity_id":2,"type_activity_name":"yyy"}, 
          {"type_activity_id":3,"type_activity_name":"zzz"}
         ]
ExampleTuple = namedtuple('ExampleTuple', ['type_activity_id', 'type_activity_name'])

for my_dict in mylist:
    namedtupleobj = ExampleTuple(**my_dict)
    final_list.append(namedtupleobj)

sqlContext.createDataFrame(final_list).show(truncate=False)

output

+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1               |xxx               |
|2               |yyy               |
|3               |zzz               |
+----------------+------------------+

my version informations are as follows

spark: 2.4.0
python: 3.6

It is not necessary to have my_list variable. since it was available I have used it to create namedtuple object otherwise directly namedtuple object can be created.

How to convert list of dictionaries into Pyspark DataFrame

4 Answers4

Linked