Create a dataframe from column of dictionaries in pyspark

Question

I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;

........................................................
  data 
........................................................
 {"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................

How do I convert into the dataframe with individual columns like following.

 name   address    occupation
 sam       uk       
 jack      aus       job

Possible duplicate of [How to convert list of dictionaries into Spark DataFrame](https://stackoverflow.com/questions/52238803/how-to-convert-list-of-dictionaries-into-spark-dataframe) — pault, Nov 09 '18 at 12:34
Or a dupe of [Pyspark: explode json in column to multiple columns](https://stackoverflow.com/questions/51070251/pyspark-explode-json-in-column-to-multiple-columns?noredirect=1&lq=1). It's hard to tell from your question — pault, Nov 09 '18 at 12:44
@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary. — amol desai, Nov 11 '18 at 05:20
Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is `df` a pandas DataFrame? Or is the `data` column actually of type `StringType()` or `MapType()`? [Edit] your question with the output of `df.select('data').printSchema()`. Better yet, provide a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). Maybe you're looking for [this answer](https://stackoverflow.com/a/50685590/5858851). — pault, Nov 13 '18 at 15:55

score 3 · Answer 1 · answered Nov 09 '18 at 04:48

3

Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.

data = [
    {"name": "sam", "address":"uk"}, 
    {"name":"jack" , "address":"aus", "occupation":"job"}
]

spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('') 
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
|     uk| sam|          |
|    aus|jack|       job|
+-------+----+----------+

answered Nov 09 '18 at 04:48

cs95

379,657
97
704
746

I have tried this method its giving py4j.Py4JException: Method __getnewargs__([]) does not exist. The data is a column name of dataframe df. – amol desai Nov 09 '18 at 04:56

score 0 · Answer 2 · answered Nov 09 '18 at 08:24

If the order of rows is not important, this is another way you can do this:

from pyspark import SparkContext
sc = SparkContext()

df = sc.parallelize([
    {"name":"jack" , "address":"aus", "occupation":"job"},
    {"name": "sam", "address":"uk"}     
 ]).toDF()

df = df.na.fill('')

df.show()

+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
|    aus|jack|       job|
|     uk| sam|          |
+-------+----+----------+

you have to have all the columns in the first row all the time =\ — Andrew Matiuk, May 29 '22 at 14:33

Create a dataframe from column of dictionaries in pyspark

2 Answers2