My Data looks like this:
id | duration | action1 | action2 | ...
---------------------------------------------
1 | 10 | A | D
1 | 10 | B | E
2 | 25 | A | E
1 | 7 | A | G
I want to group it by ID (which works great!):
df.rdd.groupBy(lambda x: x['id']).mapValues(list).collect()
And now I would like to group values within each group by duration to get something like this:
[(id=1,
((duration=10,[(action1=A,action2=D),(action1=B,action2=E),
(duration=7,(action1=A,action2=G)),
(id=2,
((duration=25,(action1=A,action2=E)))]
And here is where I dont know how to do a nested group by. Any tips?