Pyspark (1.6.1) SQL.dataframe column to vector aggregation without Hive

Asked May 08 '16 at 12:13

Active May 08 '16 at 12:13

Viewed 228 times

Suppose my SQL dataframe df is like this:

| id | v1 | v2 |
|----+----+----|
|  1 |  0 |  3 |
|  1 |  0 |  3 |
|  1 |  0 |  8 |
|  4 |  1 |  2 |

I want the output to be:

| id |  v1  |  list(v2)  |
|----+----+--------------|
|  1 |  [0] |    [3,3,8] |
|  4 |  [1] |        [2] |

What is the most simple way of doing this with SQL dataframe without Hive?

1) Apparently, with Hive support one could simply use collect_set() and collect_list() aggregate functions. But these functions do not work in plain Spark SqlContext.

2) An other way would be to make an UDAF, but given the amount of code needed, this seems overkill for such a simple aggregation.

3) I could use df.rdd and then use the groupBy() function. This is my last resort. I actually converted the RDD to DF to make data manipulations easier, but apparently not...

Are there any other simple ways that I missed?

asked May 08 '16 at 12:13

Davor Josipovic

5,296
1
39
57

What is wrong with HiveContext? – zero323 May 08 '16 at 12:18
@zero323 Not configured. – Davor Josipovic May 08 '16 at 12:19
Is that an issue? For many applications local derby as a metastore should work just fine. If not you can use http://stackoverflow.com/a/32750733/1560062 and http://stackoverflow.com/q/33233737/1560062 replacing final step. – zero323 May 08 '16 at 12:21

Pyspark (1.6.1) SQL.dataframe column to vector aggregation without Hive

0 Answers0

Linked