Pandas groupby with new column for each value

Question

I hope the title speaks for itself; I'd just like to add that it can be assumed that each key has the same amount of values. Online searching the title yielded the following solution:

Split pandas dataframe based on groupby

Which supposed to be solving my problem, although it does not. I'll give an example:

Input:

pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})

Output:

pd.DataFrame(data={'a':['foo','bar'],'b':[1,4],'c':[2,5],'d':[3,6]})

Intuitively, it would be a groupby function without an aggregation function, or an aggregation function that makes a list out of the keys.

Obviously, it can be done 'manually' using for loops etc., but using for loops with large data sets is very expensive computationally.

jezrael · Accepted Answer · 2019-04-07T12:25:50.927

2

Use GroupBy.cumcount for Series or column g, then reshape by DataFrame.set_index + Series.unstack or DataFrame.pivot, last data cleaning by DataFrame.add_prefix, DataFrame.rename_axis with DataFrame.reset_index:

g = df1.groupby('a').cumcount()
df = (df1.set_index(['a', g])['b']
         .unstack()
         .add_prefix('new_')
         .reset_index()
         .rename_axis(None, axis=1))
print (df)
     a  new_0  new_1  new_2
0  bar      4      5      6
1  foo      1      2      3

Or:

df1['g'] = df1.groupby('a').cumcount()
df = df1.pivot('a','g','b').add_prefix('new_').reset_index().rename_axis(None, axis=1)
print (df)
     a  new_0  new_1  new_2
0  bar      4      5      6
1  foo      1      2      3

edited Apr 07 '19 at 12:25

answered Apr 07 '19 at 12:19

jezrael

822,522
95
1,334
1,252

Thanks a lot. This definitely Solved this, only it raised another issue trying to sort each row: "df_db[relevant_cols]=df_db[relevant_cols].apply(np.sort,axis=1)" Raised: ValueError: cannot label index with a null key I tried to reset index again, didn't work. Any ideas? – user9548409 Apr 07 '19 at 13:36
@user9548409 - One idea - is possible sorting before? Like `df = df.sort_values('b')` and then apply my solution? – jezrael Apr 07 '19 at 13:40
Sorry no, I'm going to have some null values after the groupby that I need to fill 0s with, and only then sort.. sorting before will mess up the order. If it matters, I used the first method(not the pivot function). Thanks again – user9548409 Apr 07 '19 at 13:47
@user9548409 - Ok, then use `df_db[relevant_cols] = np.sort(df_db[relevant_cols], axis=1)` – jezrael Apr 07 '19 at 13:52
@user9548409 - tested with `df_db = pd.DataFrame({ 'A':list('abcdef'), 'B':[4,5,4,5,5,4], 'C':[7,8,9,4,2,3], 'D':[1,3,5,7,1,0], 'E':[5,3,6,9,2,4], 'F':list('aaabbb') }) print (df_db) relevant_cols = ['B','C','E'] df_db[relevant_cols] = np.sort(df_db[relevant_cols], axis=1) print (df_db) ` – jezrael Apr 07 '19 at 13:52
Obviously, this one will work. It does not work after applying the groupby.. it messes up the indexes in some way. I tried to reset index yet still does not work – user9548409 Apr 07 '19 at 13:55
@user9548409 - If need working with data from question, use `df = df1.pivot('a','g','b')` and then `df = np.sort(df, axis=1)` and after sorting use `.reset_index().rename_axis(None, axis=1)`. reason is very easy - after reset_index get first column non numeric, and `np.sort` should be problem. so if first sort and then reset_index then working only with numeric columns (`new_0 new_1 new_2`) – jezrael Apr 07 '19 at 13:58
I really doubt it can't be tweaked into sorting after resetting index, but it really does work if you sort earlier. Thanks a lot you were right and it is solved! – user9548409 Apr 07 '19 at 15:25

score 1 · Answer 2 · answered Apr 07 '19 at 12:53

1

Here is an alternative approach, using groupby.apply and string.ascii_lowercase if column names are important:

from string import ascii_lowercase

df = pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})

# Groupby 'a'
g = df.groupby('a')['b'].apply(list)

# Construct new DataFrame from g
new_df = pd.DataFrame(g.values.tolist(), index=g.index).reset_index()

# Fix column names
new_df.columns = [x for x in ascii_lowercase[:new_df.shape[1]]]

print(new_df)

     a  b  c  d
0  bar  4  5  6
1  foo  1  2  3

answered Apr 07 '19 at 12:53

Chris Adams

18,389
4
22
39

Thanks, it also works. I used the other comment's method first, but after encountering sorting issues ("ValueError: cannot label index with a null key") I tried this one too and it raised the same error. Help will be appreciated solving this out as well. Googling didn't help so far. – user9548409 Apr 07 '19 at 13:52

Pandas groupby with new column for each value

2 Answers2