Unstacking values in pandas groups into new columns

Question

I have a data frame roughly like this

data = [
    {'user_id': 1, 'week': 1, 'score': 1},
    {'user_id': 1, 'week': 2, 'score': 2},
    {'user_id': 1, 'week': 2, 'score': 3},
    {'user_id': 2, 'week': 1, 'score': 1},
    {'user_id': 2, 'week': 1, 'score': 1}]
df = pd.DataFrame(data)

+---------+------+-------+
| user_id | week | score |
+---------+------+-------+
|       1 |    1 |     1 |
|       1 |    2 |     2 |
|       1 |    2 |     3 |
|       2 |    1 |     1 |
|       2 |    1 |     1 |
+---------+------+-------+

I want to group this by user_id and week, but then take each score in each group and pivot it into a new column, so that the resulting data frame looks like this:

+---------+------+--------+--------+
| user_id | week | score1 | score2 |
+---------+------+--------+--------+
|       1 |    1 |      1 |        |
|       1 |    2 |      2 |      3 |
|       2 |    1 |      1 |      1 |
+---------+------+--------+--------+

The group-by is straightforward,

df.groupby(['user_id', 'week'], as_index=False)

but I cannot see how to do the reshaping

score 4 · Accepted Answer · answered Nov 06 '19 at 17:02

4

you can use groupby.cumcount() with assign() and set_index() and unstack():

m=(df.assign(k=df.groupby(['user_id','week']).cumcount())
                             .set_index(['user_id','week','k']).unstack())
m.columns=[f'{a}_{b}' for a,b in m.columns]
print(m.reset_index())

   user_id  week  score_0  score_1
0        1     1      1.0      NaN
1        1     2      2.0      3.0
2        2     1      1.0      1.0

answered Nov 06 '19 at 17:02

anky

74,114
11
41
70

Thanks, it looks great. I've not used assign so I'll need to digest this! – dumbledad Nov 06 '19 at 17:06
@dumbledad it is similar to assigning a column `k` except that it returns a copy and doesnot assign inplace, so we can chain it with other functions :) – anky Nov 06 '19 at 17:07
I am torn about which of these to accept as the answer. I do find @ansev's [answer](https://stackoverflow.com/a/58735532/575530) more straightforward to read, but his takes 11:10 to run on the real data and yours takes 00:06, that's over 11 minutes faster! – dumbledad Nov 08 '19 at 11:44
1

@dumbledad that is because of `apply` which is a for loop, and takes time. Good and easy to read and use for smaller datasets, not the same for bigger dataset. :) – anky Nov 08 '19 at 11:49
2

@ anky_91's answer is much better. I was really just experimenting here. That's why I started my answer with: we can also ... In general the apply method slows down and it is convenient to avoid it – ansev Nov 08 '19 at 11:49
One other unexpected difference between anky_91's vectorised solution and ansev's one is that the faster vectorised one throws a "Unstacked DataFrame is too big, causing int32 overflow" ValueError for some of my data, while the looped (apply) vesion does not. I've added a try/catch to swap when there's an overflow. – dumbledad Nov 11 '19 at 13:57

ansev · Answer 2 · 2019-11-06T17:30:16.430

2

We can also use groupby + apply(list) and apply(pd.Series):

new_df=( df.groupby(['user_id', 'week'])
           .score
           .apply(list)
           .apply(pd.Series)
           .add_prefix('score_')
           .reset_index() )
print(new_df)

   user_id  week  score_0  score_1
0        1     1      1.0      NaN
1        1     2      2.0      3.0
2        2     1      1.0      1.0

edited Nov 06 '19 at 17:30

answered Nov 06 '19 at 17:25

ansev

30,322
5
17
31

I really need a tattoo that reminds me that pandas `list` can be used as an aggregate function, it was key to [another answer](https://stackoverflow.com/a/57055945/575530) that really helped me. List, list, list, list, list, list, list, ... – dumbledad Nov 07 '19 at 10:49

Unstacking values in pandas groups into new columns

2 Answers2