create spark data frame from a list of dictionaries of different structures

Question

I have a list of dictionaries

say

list_ = [
 {u'column1': u'test1', u'column2': u'None'},
 {u'added_column1': u'test2', u'column2': u'None'}]

First row has two columns column1,column2

Second row has two columns added_column1, column2

I want to create a spark dataframe based on the data and should change as the list changes

Is there any long term solution?

Currently

spark.createDataFrame(list_).show()

This works but I get this warning.

UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated,"

They're actually 2 questions: keep spark dataframe sync with dict, and create a dataframe from a dict (which is a duplicate of [this question](https://stackoverflow.com/questions/52238803/how-to-convert-list-of-dictionaries-into-spark-dataframe)) — knh190, Apr 26 '19 at 22:54
The link you shared does not have the solution to my problem. How can I create a unified data frame with varying dictionaries? — User_99999, Apr 26 '19 at 22:58
I fixed the link. I mistaken your list of dicts with a dict. — knh190, Apr 26 '19 at 22:59
The new link does not solve my problem:(. looking for a solution in which a new column(s) is/are added based on the list of dictionaries. The solution in the link creates only two columns — User_99999, Apr 26 '19 at 23:03
Then you're even more complicating the question! But a new column can be added using `WithColumn`, I'm sure you can search a bunch of related questions. BTW you were creating and not appending to an dataframe in your original post. — knh190, Apr 26 '19 at 23:06
Well, I am having a hard time explaining my requirement. The dataframe created with the above list should contain 3 columns. I have a short term solution, but it is deprecated. I am looking for a better solution — User_99999, Apr 26 '19 at 23:09

score 0 · Answer 1 · answered Apr 28 '19 at 01:19

You can use the toDF() function on RDD and specify the ratio of sample to use to infer the schema when converting to dataframe.

list_ = [
 {u'column1': u'test1', u'column2': u'None'},
 {u'added_column1': u'test2', u'column2': u'None'}]

sc.parallelize(list_).toDF(sampleRatio=0.9).show()

Creating dataframe using rows (created from dict) requires that all rows have same number of columns

spark.createDataFrame(list(map(lambda x: Row(**x), list_))).show()

The above code will give you error: Input row doesn't have expected number of values required by the schema. 3 fields are required while 2 values are provided.

create spark data frame from a list of dictionaries of different structures

1 Answers1