1

I have a Spark dataframe(oldDF) that looks like:

Id    | Category | Count
898989  5          12
676767  12         1
334344  3          2
676767  13         3

And I want to create a new dataframe with column names of Category with value of Count grouped by Id.

The reason why I can't specify schema or would rather not is because the categories change a lot.Is there any way to do it dynamically?

An output I would like to see as Dataframe from the one above:

Id     | V3 | V5 | V12 | V13
898989   0    12    0     0
676767   0    0     1     3
334344   2    0     0     0
  • There is a typo in your code, round brackets are not closed properly. – Anas Jan 11 '16 at 13:48
  • what is type of Category column? – Anas Jan 11 '16 at 14:02
  • Can you please elaborate the actual use case? What do you mean by categories change a lot? – Anas Jan 11 '16 at 14:40
  • can you please provide an example of output DataFrame that your are looking for? you can change your oldDF and add some more data to it, and then make an example of the output DataFrame. – Rami Jan 11 '16 at 14:43
  • The categories are never the same for different models so I would have to write 30-40 different schemas as far as I understand right now. – Abdul Merzoug Jan 11 '16 at 14:59
  • I think @zero323 answered a similar question before. But I'm on my phone I can't search for it now... – eliasah Jan 11 '16 at 20:37

2 Answers2

0

You need to do your groupby operation first, then you can apply implement a pivot operation as explained here

Community
  • 1
  • 1
Rami
  • 8,044
  • 18
  • 66
  • 108
0

With Spark 1.6

oldDf.groupBy("Id").pivot("category").sum("count)
Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68