I have a dataset that contains rows with duplicates in one column, but different values in the other columns. I need to combine these duplicate rows whilst retaining the values from each row. See below example with duplicate values in the 'ID' column.
data={'ID':['01.560','05.890','05.890','02.564'],'Foo':[1,4,7,6],'Ba':['cat','dog','rabbit','monkey'],'Bar':[5.76,9.99,1.20,7.19]}
df=pd.DataFrame(data)
output:
ID Foo Ba Bar
0 01.560 1 cat 5.76
1 05.890 4 dog 9.99
2 05.890 7 rabbit 1.20
3 02.564 6 monkey 7.19
What I need is the rows with duplicate ID moved to the right with new columns (probably with some prefix to ensure col names are unique). The desired output is:
Foo Ba Bar Foo1 Ba1 Bar1
ID
01.560 1 cat 5.76 NaN NaN NaN
05.890 4 dog 9.99 7 rabbit 1.20
02.564 6 monkey 7.19 NaN NaN NaN
I have tried appending the duplicate rows to a dict, then producing a new dataframe from that dict, and appending to the original dataframe. However this method is very slow, and was wondering if there is a more straightforward way of achieving this.
def singl_line(ID,df):
line_num = 1
results=dict()
for i in range(len(df.loc[df['ID'] == ID])):
fll=df.to_dict('records')[df.loc[df['ID'] == ID].index[i]]
numbered=dict(("{} {}".format(k,line_num),v) for k,v in fll.items())
line_num+=1
results.update(numbered)
continue
return results
df_1ln=pd.DataFrame()
full=df['ID'].tolist()
nodup = []
for i in full:
if i not in nodup:
nodup.append(i)
continue
nodup
for i in nodup:
temp=pd.DataFrame([singl_line(i,df)],columns=singl_line(i,df).keys())
df_1ln=df_1ln.append(temp,sort=False)
continue
df_1ln
output:
ID 1 Foo 1 Ba 1 Bar 1 ID 2 Foo 2 Ba 2 Bar 2
0 01.560 1 cat 5.76 NaN NaN NaN NaN
0 05.890 4 dog 9.99 05.890 7.0 rabbit 1.2
0 02.564 6 monkey 7.19 NaN NaN NaN NaN
The dataset i am working with has ~4000 rows and 150 columns with about 10 duplicates for each ID, so i am looking for a method which is faster than the one above.