3

This is my code:

frst_df = df.drop(columns=["Comment"]).groupby(['source'], as_index=False).agg('first')
cmnt_df = df.groupby(['source'], as_index=False)['Comment'].apply(', '.join)
merge_df = pd.merge(frst_df, cmnt_df , on='source')

I hope it is understandable what I'm trying to do here.

I have a large dataframe where I have a column 'source'. This is the primary column of the dataframe. Now for the column 'Comment', I want to join all comments corresponding to the value of the 'source'. There are approx 50 other columns in the dataframe. I want to pick only the first element from all the values corresponding to the 'source'.

The code I wrote works fine, but the dataframe is huge and it takes lots of time to create two separate dataframes and then merge them. Is there any better way to do this?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Asif Iqbal
  • 421
  • 3
  • 13
  • Welcome to Stackoverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Sep 18 '20 at 08:45

2 Answers2

1

You can use GroupBy.agg by dictionary - all columns are aggregate by first only Comment by join:

df = pd.DataFrame({
        'Comment':list('abcdef'),
         'B':[4,5,4,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,5,7,1,0],
         'E':[5,3,6,9,2,4],
         'source':list('aaabbc')
})

d = dict.fromkeys(df.columns.difference(['source']), 'first')
d['Comment'] = ', '.join

merge_df = df.groupby('source', as_index=False).agg(d)
print (merge_df)
  source  B  C  Comment  D  E
0      a  4  7  a, b, c  1  5
1      b  5  4     d, e  7  9
2      c  4  3        f  0  4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

This is another possible solution.

df['Comment'] = df.groupby('source')['Comment'].transform(lambda x: ','.join(x))
df = df.groupby('source').first()
rhug123
  • 7,893
  • 1
  • 9
  • 24