I have a dataframe having 3 columns, let title, area, code. There are different values, including blanks in area and code columns for same title. I want to combine all these values of area and code separated by space such that there is one row for one title. I tried doing this using the code
df['area']=df['area'].replace("\\N",np.nan)
df['code']=df['code'].replace("\\N",np.nan)
df['area'] =df['area'].fillna('')
df['code'] =df['code'].fillna('')
df[['area','code']] = df.groupby(['title'])[['area','code']].transform(lambda x: ' '.join(x))
I got what I wanted but this code is creating multiple rows having same data. For example if my dataset looks like this:
title area code
a yv f
a ty \N
a dz r
b yv \N
b \N f
c dz f
Then I want my output to be:
title area code
a yv ty dz f r
b yv dz f
c dz f
But I'm getting 3 duplicated rows for a, 2 for b and 1 for c containing same values. I can remove duplicate values later on, but my dataframe is very large (containing approx 2.4 million rows) and I'm not able to execute this command:
df[['area','code']] = df.groupby(['title'])[['area','code']].transform(lambda x: ' '.join(x))
for it. So it would be helpful if I could remove duplicate values side by side, so that it doesn't take much RAM for processing. Or any other solution would be helpful too. Thanks in advance!