29

I want to convert my list of dictionaries into DataFrame. This is the list:

mylist = 
[
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

This is my code:

from pyspark.sql.types import StringType

df = spark.createDataFrame(mylist, StringType())

df.show(2,False)

+-----------------------------------------+
|                                    value|
+-----------------------------------------+
|{type_activity_id=1,type_activity_id=xxx}|
|{type_activity_id=2,type_activity_id=yyy}|
|{type_activity_id=3,type_activity_id=zzz}|
+-----------------------------------------+

I assume that I should provide some mapping and types for each column, but I don't know how to do it.

Update:

I also tried this:

schema = ArrayType(
    StructType([StructField("type_activity_id", IntegerType()),
                StructField("type_activity_name", StringType())
                ]))
df = spark.createDataFrame(mylist, StringType())
df = df.withColumn("value", from_json(df.value, schema))

But then I get null values:

+-----+
|value|
+-----+
| null|
| null|
+-----+
lego king
  • 558
  • 5
  • 11
Markus
  • 3,562
  • 12
  • 48
  • 85

4 Answers4

37

In the past, you were able to simply pass a dictionary to spark.createDataFrame(), but this is now deprecated:

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]
df = spark.createDataFrame(mylist)
#UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
#  warnings.warn("inferring schema from dict is deprecated,"

As this warning message says, you should use pyspark.sql.Row instead.

from pyspark.sql import Row
spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)
#+----------------+------------------+
#|type_activity_id|type_activity_name|
#+----------------+------------------+
#|1               |xxx               |
#|2               |yyy               |
#|3               |zzz               |
#+----------------+------------------+

Here I used ** (keyword argument unpacking) to pass the dictionaries to the Row constructor.

pault
  • 41,343
  • 15
  • 107
  • 149
  • Thanks. Do you know why it was deprecated? – Markus Sep 10 '18 at 15:41
  • 1
    I am not sure why. As an aside, this is probably faster than converting to/from json. – pault Sep 10 '18 at 15:44
  • 1
    But this may not work when the structure of each dictionary (array element) is not same. – Adiga Jun 28 '20 at 09:31
  • 1
    Using the `spark.createDataFrame(Row(**x) for x in mylist)` method in PySpark 3.0.0, I'm getting downstream issues where values are placed in the wrong columns. Possibly related to https://issues.apache.org/jira/browse/SPARK-26200 – Daniel Himmelstein Aug 20 '20 at 19:30
  • how to make sure the values in dict are of correct type, or typecast them if neccssary? – Gadam Dec 10 '20 at 20:04
15

You can do it like this. You will get a dataframe with 2 columns.

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

myJson = sc.parallelize(mylist)
myDf = sqlContext.read.json(myJson)

Output :

+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|               1|               xxx|
|               2|               yyy|
|               3|               zzz|
+----------------+------------------+
Arvind
  • 87
  • 13
pissall
  • 7,109
  • 2
  • 25
  • 45
5

in Spark Version 2.4 it is possible to do directly with df=spark.createDataFrame(mylist)

>>> mylist = [
...   {"type_activity_id":1,"type_activity_name":"xxx"},
...   {"type_activity_id":2,"type_activity_name":"yyy"},
...   {"type_activity_id":3,"type_activity_name":"zzz"}
... ]
>>> df1=spark.createDataFrame(mylist)
>>> df1.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|               1|               xxx|
|               2|               yyy|
|               3|               zzz|
+----------------+------------------+
anvy elizabeth
  • 130
  • 1
  • 7
  • 1
    It still gives me this warning though `UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead` – Adiga Jun 28 '20 at 04:55
0

I was also facing the same issue when creating dataframe from list of dictionaries. I have resolved this using namedtuple.

Below is my code using data provided.

from collections import namedtuple
final_list = []
mylist = [{"type_activity_id":1,"type_activity_name":"xxx"},
          {"type_activity_id":2,"type_activity_name":"yyy"}, 
          {"type_activity_id":3,"type_activity_name":"zzz"}
         ]
ExampleTuple = namedtuple('ExampleTuple', ['type_activity_id', 'type_activity_name'])

for my_dict in mylist:
    namedtupleobj = ExampleTuple(**my_dict)
    final_list.append(namedtupleobj)

sqlContext.createDataFrame(final_list).show(truncate=False)

output

+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1               |xxx               |
|2               |yyy               |
|3               |zzz               |
+----------------+------------------+

my version informations are as follows

spark: 2.4.0
python: 3.6

It is not necessary to have my_list variable. since it was available I have used it to create namedtuple object otherwise directly namedtuple object can be created.

Athar
  • 963
  • 10
  • 23