Combining Python dictionaries into a Spark dataframe when the dictionaries have different keys

Question

If I have a list of dictionaries that looks something like this:

list = [{'a': 1, 'b': 2, 'c': 3}, {'b': 4, 'c': 5, 'd': 6, 'e': 7}]

How can I convert the list to a Spark dataframe without dropping any keys that may not be shared between the dictionaries? For example, if I use sc.parallelize(list).toDF(), the resulting dataframe would have columns 'a', 'b', and 'c' with column 'a' being null for the second dictionary, and columns 'd' and 'e' from the second dictionary would be droppped completely.

From playing around with the order of the dictionaries, I see that it defers to the keys in the dictionary that appears first in the list, so if I were to swap the dictionaries in my example above, my resulting dataframe would have columns 'b', 'c', 'd', and 'e'.

In reality, there will be far more than two dictionaries in this list, and there will be no guarantee that the keys will be the same from dictionary to dictionary, so it's important that I find a reliable way to handle potentially different keys.

In my example above, the row corresponding to the first dictionary should have null/na values for columns 'd' and 'e', and the row corresponding to the second dictionary should have a null/na value for column 'a'. All other keys are shared, so they should appear in the appropriate column based on key. — Eric J, Feb 27 '20 at 02:07
Does this answer your question? [How to convert list of dictionaries into Pyspark DataFrame](https://stackoverflow.com/questions/52238803/how-to-convert-list-of-dictionaries-into-pyspark-dataframe) — AMC, Feb 27 '20 at 02:27
Not quite. I came across the above earlier, but it suggests using sc.parallelize which doesn't return the desired dataframe when the dictionaries are different sizes. — Eric J, Feb 27 '20 at 03:06

Vamsi Prabhala · Accepted Answer · 2020-02-27T03:01:29.163

You can pass the dictionary to createDataFrame function.

l = [{'a': 1, 'b': 2, 'c': 3}, {'b': 4, 'c': 5, 'd': 6, 'e': 7}]
df = spark.createDataFrame(l)
#UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
#warnings.warn("inferring schema from dict is deprecated
df.show()

+----+---+---+----+----+
|   a|  b|  c|   d|   e|
+----+---+---+----+----+
|   1|  2|  3|null|null|
|null|  4|  5|   6|   7|
+----+---+---+----+----+

Also provide schema for the columns as schema inference for dictionaries is deprecated. Using Row objects to create a data frame requires all the dictionaries to have the same columns.

Programmatically defining schema by merging keys from all the dictionaries involved.

from pyspark.sql.types import StructType,StructField,IntegerType

#Function to merge keys from several dicts
def merge_keys(*dict_args):
    result = set()
    for dict_arg in dict_args:
        for key in dict_arg.keys():
            result.add(key)
    return sorted(list(result))

#Generate schema given a column list
def generate_schema(columns):
    result = StructType()
    for column in columns:
        result.add(column,IntegerType(),nullable=True) #change type and nullability as needed
    return result

df = spark.createDataFrame(l,schema=generate_schema(merge_keys(*l)))

@Vamsi Prabhala, I have a much simpler question but similar logic. Thanks in advance! https://stackoverflow.com/questions/62318004/pyspark-dealing-w-dict-with-no-values — jgtrz, Jun 11 '20 at 17:24

Combining Python dictionaries into a Spark dataframe when the dictionaries have different keys

1 Answers1