Groupby on a column and apply function on another column but keep first element of all other columns of dataframe

Question

This is my code:

frst_df = df.drop(columns=["Comment"]).groupby(['source'], as_index=False).agg('first')
cmnt_df = df.groupby(['source'], as_index=False)['Comment'].apply(', '.join)
merge_df = pd.merge(frst_df, cmnt_df , on='source')

I hope it is understandable what I'm trying to do here.

I have a large dataframe where I have a column 'source'. This is the primary column of the dataframe. Now for the column 'Comment', I want to join all comments corresponding to the value of the 'source'. There are approx 50 other columns in the dataframe. I want to pick only the first element from all the values corresponding to the 'source'.

The code I wrote works fine, but the dataframe is huge and it takes lots of time to create two separate dataframes and then merge them. Is there any better way to do this?

Welcome to Stackoverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. — jezrael, Sep 18 '20 at 08:45

jezrael · Accepted Answer · 2020-09-18T09:17:59.647

You can use GroupBy.agg by dictionary - all columns are aggregate by first only Comment by join:

df = pd.DataFrame({
        'Comment':list('abcdef'),
         'B':[4,5,4,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,5,7,1,0],
         'E':[5,3,6,9,2,4],
         'source':list('aaabbc')
})

d = dict.fromkeys(df.columns.difference(['source']), 'first')
d['Comment'] = ', '.join

merge_df = df.groupby('source', as_index=False).agg(d)
print (merge_df)
  source  B  C  Comment  D  E
0      a  4  7  a, b, c  1  5
1      b  5  4     d, e  7  9
2      c  4  3        f  0  4

score 1 · Answer 2 · answered Sep 18 '20 at 12:55

1

This is another possible solution.

df['Comment'] = df.groupby('source')['Comment'].transform(lambda x: ','.join(x))
df = df.groupby('source').first()

answered Sep 18 '20 at 12:55

rhug123

7,893
1
9
24

Groupby on a column and apply function on another column but keep first element of all other columns of dataframe

2 Answers2

Linked