2

I have a dataframe with duplicate identifier, however the data attributes are different. I want to remove the duplicate rows by combining their data into new columns.

Sample Data:

id type subtype value
111 a sub1 100
111 b sub2 200
112 c sub2 100
113 a sub3 100
114 b sub1 300
114 c sub1 100
import pandas as pd
data = {'id':['111', '111', '112', '113', '114','114'],
        'type':['a', 'b', 'c', 'a', 'b', 'c'],
        'subtype':['sub1', 'sub2', 'sub2', 'sub3', 'sub1', 'sub1'],
        'value':[100, 200, 100, 100, 300, 100]}
df = pd.DataFrame(data)
df

Desired output would be like this, where rows with duplicate identifiers are combined through adding to new columns:

id type subtype value type1 subtype1 value1
111 a sub1 100 b sub2 200
112 c sub2 100 null null null
113 a sub3 100 null null null
114 b sub1 300 c sub1 100
import pandas as pd
output = {'id':['111', '112', '113', '114'],
        'type':['a',  'c', 'a', 'b', ],
        'subtype':['sub1', 'sub2', 'sub3', 'sub1'],
        'value':[100, 100, 100, 300],
        'type1':['b', 'null', 'null', 'c'],
        'subtype1':['sub2', 'null', 'null', 'sub1'],
        'value1':[ 200, 'null', 'null', 100]}
df1 = pd.DataFrame(output)
df1

Note, in the real data, duplicate rows could be more than 2 for each duplicate identifier.

Please help me out if you can, much appreciated!

Agnij
  • 561
  • 3
  • 13

2 Answers2

0

I think this code will do what you want even if id occurrences are more than two.

df = pd.DataFrame(data)
df["ind"]=[i for i in range(0,len(df))]
data_dict=dict()
for column in df.columns:
    i=0
    for value in df[column]:
        data_dict[(i,column)]=value
        i+=1        
i=0
values_counts=dict()
new_columns=dict()
for value in df["id"]:
    if value in values_counts.keys():
        for column in df[df["ind"]==values_counts[value][1]].columns:
            if column != "ind" and column !="id":
                new_columns[(values_counts[value][1],column+str(values_counts[value][0]))]=data_dict[(i,column)]
        df=df.drop(i)
        values_counts[value][0]+=1
    else:
        values_counts[value]=[1,i]      
    i+=1
for key in new_columns.keys():
    df.loc[key[0],key[1]]=new_columns[key]
print(df)
0

You can use pandas groupby functionality-

df_ = df.groupby('id').agg({'type':lambda y:','.join(y).split(','),
                            'subtype':lambda z:','.join(z).split(','),
                            'value':lambda x:','.join(map(str, x)).split(',')}).reset_index()

     id type    subtype         value
0   111 [a, b]  [sub1, sub2]    [100, 200]
1   112 [c]           [sub2]    [100]
2   113 [a]           [sub3]    [100]
3   114 [b, c]  [sub1, sub1]    [300, 100]

Following this you can separate them into multiple columns to get-

df_[['type_'+str(_) for _ in range(len(max(df_['type'].to_list(),key=len)))]] = pd.DataFrame(df_.type.to_list(),index = df_.index)

df_[['subtype_'+str(_) for _ in range(len(max(df_['subtype'].to_list(),key=len)))]] = pd.DataFrame(df_.subtype.to_list(),index = df_.index)

df_[['value_'+str(_) for _ in range(len(max(df_['value'].to_list(),key=len)))]] = pd.DataFrame(df_.value.to_list(),index = df_.index)

output -

     id         type        subtype        value  type_0    type_1  subtype_0   subtype_1   value_0 value_1
0   111       [a, b]    [sub1, sub2]    [100, 200]     a         b       sub1   sub2         100       200
1   112          [c]          [sub2]         [100]     c      None       sub2   None         100       None
2   113          [a]          [sub3]         [100]     a      None       sub3   None         100       None
3   114       [b, c]    [sub1, sub1]    [300, 100]     b         c       sub1   sub1         300       100

Further, there are slightly similar problems out there that have answers, these can be experimented with - ref. stack_link1, stack_link2

Agnij
  • 561
  • 3
  • 13