1

I have a data frame roughly like this

data = [
    {'user_id': 1, 'week': 1, 'score': 1},
    {'user_id': 1, 'week': 2, 'score': 2},
    {'user_id': 1, 'week': 2, 'score': 3},
    {'user_id': 2, 'week': 1, 'score': 1},
    {'user_id': 2, 'week': 1, 'score': 1}]
df = pd.DataFrame(data)
+---------+------+-------+
| user_id | week | score |
+---------+------+-------+
|       1 |    1 |     1 |
|       1 |    2 |     2 |
|       1 |    2 |     3 |
|       2 |    1 |     1 |
|       2 |    1 |     1 |
+---------+------+-------+

I want to group this by user_id and week, but then take each score in each group and pivot it into a new column, so that the resulting data frame looks like this:

+---------+------+--------+--------+
| user_id | week | score1 | score2 |
+---------+------+--------+--------+
|       1 |    1 |      1 |        |
|       1 |    2 |      2 |      3 |
|       2 |    1 |      1 |      1 |
+---------+------+--------+--------+

The group-by is straightforward,

df.groupby(['user_id', 'week'], as_index=False)

but I cannot see how to do the reshaping

dumbledad
  • 16,305
  • 23
  • 120
  • 273

2 Answers2

4

you can use groupby.cumcount() with assign() and set_index() and unstack():

m=(df.assign(k=df.groupby(['user_id','week']).cumcount())
                             .set_index(['user_id','week','k']).unstack())
m.columns=[f'{a}_{b}' for a,b in m.columns]
print(m.reset_index())

   user_id  week  score_0  score_1
0        1     1      1.0      NaN
1        1     2      2.0      3.0
2        2     1      1.0      1.0
anky
  • 74,114
  • 11
  • 41
  • 70
  • Thanks, it looks great. I've not used assign so I'll need to digest this! – dumbledad Nov 06 '19 at 17:06
  • @dumbledad it is similar to assigning a column `k` except that it returns a copy and doesnot assign inplace, so we can chain it with other functions :) – anky Nov 06 '19 at 17:07
  • I am torn about which of these to accept as the answer. I do find @ansev's [answer](https://stackoverflow.com/a/58735532/575530) more straightforward to read, but his takes 11:10 to run on the real data and yours takes 00:06, that's over 11 minutes faster! – dumbledad Nov 08 '19 at 11:44
  • 1
    @dumbledad that is because of `apply` which is a for loop, and takes time. Good and easy to read and use for smaller datasets, not the same for bigger dataset. :) – anky Nov 08 '19 at 11:49
  • 2
    @ anky_91's answer is much better. I was really just experimenting here. That's why I started my answer with: we can also ... In general the apply method slows down and it is convenient to avoid it – ansev Nov 08 '19 at 11:49
  • One other unexpected difference between anky_91's vectorised solution and ansev's one is that the faster vectorised one throws a "Unstacked DataFrame is too big, causing int32 overflow" ValueError for some of my data, while the looped (apply) vesion does not. I've added a try/catch to swap when there's an overflow. – dumbledad Nov 11 '19 at 13:57
2

We can also use groupby + apply(list) and apply(pd.Series):

new_df=( df.groupby(['user_id', 'week'])
           .score
           .apply(list)
           .apply(pd.Series)
           .add_prefix('score_')
           .reset_index() )
print(new_df)

   user_id  week  score_0  score_1
0        1     1      1.0      NaN
1        1     2      2.0      3.0
2        2     1      1.0      1.0
ansev
  • 30,322
  • 5
  • 17
  • 31
  • I really need a tattoo that reminds me that pandas `list` can be used as an aggregate function, it was key to [another answer](https://stackoverflow.com/a/57055945/575530) that really helped me. List, list, list, list, list, list, list, ... – dumbledad Nov 07 '19 at 10:49