1

I have a list of dictionaries

say

list_ = [
 {u'column1': u'test1', u'column2': u'None'},
 {u'added_column1': u'test2', u'column2': u'None'}]

First row has two columns column1,column2

Second row has two columns added_column1, column2

I want to create a spark dataframe based on the data and should change as the list changes

Is there any long term solution?

Currently

spark.createDataFrame(list_).show() 

This works but I get this warning.

UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated,"

knh190
  • 2,744
  • 1
  • 16
  • 30
User_99999
  • 11
  • 1
  • 2
  • They're actually 2 questions: keep spark dataframe sync with dict, and create a dataframe from a dict (which is a duplicate of [this question](https://stackoverflow.com/questions/52238803/how-to-convert-list-of-dictionaries-into-spark-dataframe)) – knh190 Apr 26 '19 at 22:54
  • The link you shared does not have the solution to my problem. How can I create a unified data frame with varying dictionaries? – User_99999 Apr 26 '19 at 22:58
  • I fixed the link. I mistaken your list of dicts with a dict. – knh190 Apr 26 '19 at 22:59
  • The new link does not solve my problem:(. looking for a solution in which a new column(s) is/are added based on the list of dictionaries. The solution in the link creates only two columns – User_99999 Apr 26 '19 at 23:03
  • 1
    Then you're even more complicating the question! But a new column can be added using `WithColumn`, I'm sure you can search a bunch of related questions. BTW you were creating and not appending to an dataframe in your original post. – knh190 Apr 26 '19 at 23:06
  • Well, I am having a hard time explaining my requirement. The dataframe created with the above list should contain 3 columns. I have a short term solution, but it is deprecated. I am looking for a better solution – User_99999 Apr 26 '19 at 23:09

1 Answers1

0

You can use the toDF() function on RDD and specify the ratio of sample to use to infer the schema when converting to dataframe.

list_ = [
 {u'column1': u'test1', u'column2': u'None'},
 {u'added_column1': u'test2', u'column2': u'None'}]

sc.parallelize(list_).toDF(sampleRatio=0.9).show()

Creating dataframe using rows (created from dict) requires that all rows have same number of columns

spark.createDataFrame(list(map(lambda x: Row(**x), list_))).show()

The above code will give you error: Input row doesn't have expected number of values required by the schema. 3 fields are required while 2 values are provided.

Manoj Singh
  • 1,627
  • 12
  • 21