Use groupby in Pandas dataframe but it's creating multiple rows with same value

Question

I have a dataframe having 3 columns, let title, area, code. There are different values, including blanks in area and code columns for same title. I want to combine all these values of area and code separated by space such that there is one row for one title. I tried doing this using the code

df['area']=df['area'].replace("\\N",np.nan)
df['code']=df['code'].replace("\\N",np.nan)
df['area'] =df['area'].fillna('')
df['code'] =df['code'].fillna('')
df[['area','code']] = df.groupby(['title'])[['area','code']].transform(lambda x: ' '.join(x))

I got what I wanted but this code is creating multiple rows having same data. For example if my dataset looks like this:

title   area    code
a       yv       f
a       ty       \N
a       dz       r
b       yv       \N
b       \N       f
c       dz       f

Then I want my output to be:

title   area          code
a       yv ty dz       f r
b       yv dz          f
c       dz             f

But I'm getting 3 duplicated rows for a, 2 for b and 1 for c containing same values. I can remove duplicate values later on, but my dataframe is very large (containing approx 2.4 million rows) and I'm not able to execute this command:

df[['area','code']] = df.groupby(['title'])[['area','code']].transform(lambda x: ' '.join(x))

for it. So it would be helpful if I could remove duplicate values side by side, so that it doesn't take much RAM for processing. Or any other solution would be helpful too. Thanks in advance!

score 0 · Accepted Answer · answered Jan 30 '21 at 05:48

0

There are a couple of ways to do this. The easiest way (IMO) is:

df = df.groupby(['title']).agg(" ".join).reset_index()

You can see this similar question too.

answered Jan 30 '21 at 05:48

Qusai Alothman

1,982
9
23

Yes, thanks. This is the fastest way to do this. – Richa Goel Jan 31 '21 at 02:41

Use groupby in Pandas dataframe but it's creating multiple rows with same value

1 Answers1