4

I have a dataframe with a MapType column where the key is an id and the value is another StructType with two numbers, a counter and a revenue.

It looks like that:

+--------------------------------------+
| myMapColumn                          |
+--------------------------------------+
| Map(1 -> [1, 4.0], 2 -> [1, 1.5])    |
| Map()                                |
| Map(1 -> [3, 5.5])                   |
| Map(1 -> [4, 0.1], 2 -> [6, 101.56]) |
+--------------------------------------+

Now I need to sum up these two values per id and the result would be:

+----------------------+
| id | count | revenue |
+----------------------+
| 1  | 8     | 9.6     |
| 2  | 7     | 103.06  |
+----------------------+

I actually have no idea how to do that and could not find a documentation for this special case. I tried using Dataframe.groupBy but could not make it work :(

Any ideas ?

I'm using Spark 1.5.2 with Python 2.6.6

zero323
  • 322,348
  • 103
  • 959
  • 935
mabe.berlin
  • 1,043
  • 7
  • 22

1 Answers1

4

Assuming that the schema is equivalent to this:

root
 |-- myMapColumn: map (nullable = true)
 |    |-- key: integer
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- _1: integer (nullable = false)
 |    |    |-- _2: double (nullable = false)

all you need is explode and a simple aggregation:

from pyspark.sql.functions import col, explode, sum as sum_

(df
  .select(explode(col("myMapColumn")))
  .groupBy(col("key").alias("id"))
  .agg(sum_("value._1").alias("count"), sum_("value._2").alias("revenue")))
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks a lot! explode is exactly what I was looking for but didn't find it myself. I every time expected it works like `df.myMapColumn.key` `df.myMapColumn.value` but id doesn't – mabe.berlin Mar 17 '16 at 12:06
  • @mabe.berlin You can check http://stackoverflow.com/a/33850490/1560062 as well :) – zero323 Mar 17 '16 at 12:14