1

I have a dataframe having 3 columns, let title, area, code. There are different values, including blanks in area and code columns for same title. I want to combine all these values of area and code separated by space such that there is one row for one title. I tried doing this using the code

df['area']=df['area'].replace("\\N",np.nan)
df['code']=df['code'].replace("\\N",np.nan)
df['area'] =df['area'].fillna('')
df['code'] =df['code'].fillna('')
df[['area','code']] = df.groupby(['title'])[['area','code']].transform(lambda x: ' '.join(x))

I got what I wanted but this code is creating multiple rows having same data. For example if my dataset looks like this:

title   area    code
a       yv       f
a       ty       \N
a       dz       r
b       yv       \N
b       \N       f
c       dz       f

Then I want my output to be:

title   area          code
a       yv ty dz       f r
b       yv dz          f
c       dz             f

But I'm getting 3 duplicated rows for a, 2 for b and 1 for c containing same values. I can remove duplicate values later on, but my dataframe is very large (containing approx 2.4 million rows) and I'm not able to execute this command:

df[['area','code']] = df.groupby(['title'])[['area','code']].transform(lambda x: ' '.join(x))

for it. So it would be helpful if I could remove duplicate values side by side, so that it doesn't take much RAM for processing. Or any other solution would be helpful too. Thanks in advance!

Qusai Alothman
  • 1,982
  • 9
  • 23
Richa Goel
  • 13
  • 2

1 Answers1

0

There are a couple of ways to do this. The easiest way (IMO) is:

df = df.groupby(['title']).agg(" ".join).reset_index()

You can see this similar question too.

Qusai Alothman
  • 1,982
  • 9
  • 23