Flattening Maptype column in pyspark

Question

I have a pyspark DataFrame with a MapType column and want to explode this into all the columns by the name of keys

root
 |-- a: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)

I want to do sp_test.select('a.*') but getting an error:

AnalysisException: 'Can only star expand struct data types. Attribute: ArrayBuffer(a);'

If we know all the keys, this can be achieved by doing

sp_test.select(['a.%s'%item for item in ['a','b']]).show()

but i would like to remove key dependency

If we have a StrucType column, this can be achieved easily by doing display(nested_df.select('*', 'nested_array.*'))

root
 |-- _corrupt_record: string (nullable = true)
 |-- field1: long (nullable = true)
 |-- field2: long (nullable = true)
 |-- nested_array: struct (nullable = true)
 |    |-- nested_field1: long (nullable = true)
 |    |-- nested_field2: long (nullable = true)

I have some doubts:

can MapType be casted as StructType?
can we query sub-keys directly from MapType?

https://stackoverflow.com/questions/48331272/exploding-structtype-as-maptype-spark- This is not same but, you can try looking for some options in here. — pvy4917, Oct 11 '18 at 14:48

score 17 · Accepted Answer · answered Oct 11 '18 at 15:52

TL;DR: There's no easy way to do what you're asking unless you know the possible keys ahead of time.

Let me use an example to try to explain why and what your options are.

First, create the following DataFrame:

data = [({'a': 1, 'b': 2},), ({'c':3},), ({'a': 4, 'c': 5},)]
df = spark.createDataFrame(data, ["a"])
df.show()
#+-------------------+
#|                  a|
#+-------------------+
#|Map(a -> 1, b -> 2)|
#|        Map(c -> 3)|
#|Map(a -> 4, c -> 5)|
#+-------------------+

which has the following schema:

df.printSchema()
#root
# |-- a: map (nullable = true)
# |    |-- key: string
# |    |-- value: long (valueContainsNull = true)

can MapType be casted as StructType?

The simple answer is no (at least not not efficiently), unless you know the keys ahead of time.

The difference between the MapType and the StructType is that the key-value pairs for the maps are row-wise independent. That is not the case for a StructType column- in a struct column, all of the rows have the same struct fields.

For this reason, spark can't easily infer what columns to create from the map. (Remember that spark operates on each row in parallel). On the other hand, exploding a struct into columns is straightforward because all of the columns are known ahead of time.

So if you knew the keys, you can make a struct type via:

import pyspark.sql.functions as f

df_new = df.select(
    f.struct(*[f.col("a").getItem(c).alias(c) for c in ["a", "b", "c"]]).alias("a")
)
df_new.show()
#+-------------+
#|            a|
#+-------------+
#|   [1,2,null]|
#|[null,null,3]|
#|   [4,null,5]|
#+-------------+

And the new schema is:

df_new.printSchema()
#root
# |-- a: struct (nullable = false)
# |    |-- a: long (nullable = true)
# |    |-- b: long (nullable = true)
# |    |-- c: long (nullable = true)

can we query sub-keys directly from MapType?

Yes, (as shown above) you can use the getItem() which will get an item at an index out of a list, or by key out of a map.

If you don't know the keys, your only option is to explode the map into rows, groupby and pivot.

df.withColumn("id", f.monotonically_increasing_id())\
    .select("id", f.explode("a"))\
    .groupby("id")\
    .pivot("key")\
    .agg(f.first("value"))\
    .drop("id")\
    .show()
#+----+----+----+
#|   a|   b|   c|
#+----+----+----+
#|null|null|   3|
#|   1|   2|null|
#|   4|null|   5|
#+----+----+----+

In this case, we need to create an id column first so that there's something to group by.

The pivot here can be expensive, depending on the size of your data.

What if the map is nested? – kdbaseball8 Feb 26 '21 at 23:02 — kdbaseball8, Feb 26 '21 at 23:02

Flattening Maptype column in pyspark

1 Answers1