2

I have a dataset that contains rows with duplicates in one column, but different values in the other columns. I need to combine these duplicate rows whilst retaining the values from each row. See below example with duplicate values in the 'ID' column.

data={'ID':['01.560','05.890','05.890','02.564'],'Foo':[1,4,7,6],'Ba':['cat','dog','rabbit','monkey'],'Bar':[5.76,9.99,1.20,7.19]}
df=pd.DataFrame(data)


output:
       ID  Foo      Ba   Bar
0  01.560    1     cat  5.76
1  05.890    4     dog  9.99
2  05.890    7  rabbit  1.20
3  02.564    6  monkey  7.19

What I need is the rows with duplicate ID moved to the right with new columns (probably with some prefix to ensure col names are unique). The desired output is:

        Foo      Ba   Bar  Foo1  Ba1  Bar1
ID                       
01.560    1     cat  5.76  NaN   NaN   NaN
05.890    4     dog  9.99  7  rabbit  1.20
02.564    6  monkey  7.19  NaN   NaN   NaN

I have tried appending the duplicate rows to a dict, then producing a new dataframe from that dict, and appending to the original dataframe. However this method is very slow, and was wondering if there is a more straightforward way of achieving this.

def singl_line(ID,df):    
    line_num = 1
    results=dict()
    for i in range(len(df.loc[df['ID'] == ID])):
        fll=df.to_dict('records')[df.loc[df['ID'] == ID].index[i]]
        numbered=dict(("{} {}".format(k,line_num),v) for k,v in fll.items())
        line_num+=1
        results.update(numbered)
        continue
    return results
df_1ln=pd.DataFrame()
full=df['ID'].tolist()
nodup = [] 
for i in full: 
    if i not in nodup: 
        nodup.append(i)
    continue
nodup

for i in nodup:
    temp=pd.DataFrame([singl_line(i,df)],columns=singl_line(i,df).keys())
    df_1ln=df_1ln.append(temp,sort=False)
    continue
df_1ln

output:
     ID 1  Foo 1    Ba 1  Bar 1    ID 2  Foo 2    Ba 2  Bar 2
0  01.560      1     cat   5.76     NaN    NaN     NaN    NaN
0  05.890      4     dog   9.99  05.890    7.0  rabbit    1.2
0  02.564      6  monkey   7.19     NaN    NaN     NaN    NaN

The dataset i am working with has ~4000 rows and 150 columns with about 10 duplicates for each ID, so i am looking for a method which is faster than the one above.

3 Answers3

1

A self-join would work well, answered below & in this question: self-join with Pandas

df.join(df.drop('ID', 1), on='ID', rsuffix='1')
jchua
  • 354
  • 1
  • 4
  • 15
  • I am getting below error while trying to do self join.''ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat"" – Shashank Oct 17 '21 at 11:07
0

This solves your request; u will have to test it and see if it scalable

M = df.loc[df.duplicated('ID')].add_suffix('_1').set_index('ID_1')

orig = df.drop_duplicates('ID').set_index('ID')

pd.concat([orig,M],axis=1)

        Foo Ba       Bar    Foo_1   Ba_1    Bar_1
 01.560 1   cat     5.76    NaN     NaN     NaN
 05.890 4   dog     9.99    7.0     rabbit  1.2
 02.564 6   monkey  7.19    NaN     NaN     NaN

get the dataframe of duplicated values and concatenate with the dataframe of duplicate free values, on the columns axis. Note that the index is set to 'ID'

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

Another way using pivot (you will have to re-organize the columns though):

result = (df.assign(count=df.groupby("ID").cumcount())
            .pivot(index='ID', columns='count'))

result.columns = ["_".join(str(x) for x in i) for i in result.columns]

print (result)

        Foo_0  Foo_1    Ba_0    Ba_1  Bar_0  Bar_1
ID                                                
01.560    1.0    NaN     cat     NaN   5.76    NaN
02.564    6.0    NaN  monkey     NaN   7.19    NaN
05.890    4.0    7.0     dog  rabbit   9.99    1.2
Henry Yik
  • 22,275
  • 4
  • 18
  • 40