Pandas Merge Rows with Duplicate Ids Conditionally, Suitable for to CSV

Question

I have the following df and I want to merge the lines that have the same Ids, unless there are duplicates

Ids      A   B   C   D   E  F   G    H   I  J
4411    24  2   55  26  1                   
4411                        24  2   54  26  0
4412    22  4   54  26  0                   
4412                        18  8   54  26  0
7401    12  14  54  26  0                   
7401                        0   25  53  26  0
7402    24  2   54  26  0                   
7402                        25  1   54  26  0
10891   16  10  54  26  0                   
10891                       3   23  54  26  0
10891                       5   10  6   15  0

Example output

Ids      A   B   C   D   E  F   G    H   I  J
4411    24  2   55  26  1   24  2   54  26  0                               
4412    22  4   54  26  0   18  8   54  26  0                                   
7401    12  14  54  26  0   0   25  53  26  0                                   
7402    24  2   54  26  0   25  1   54  26  0                                   
10891   16  10  54  26  0   3   23  54  26  0               
10891                       5   10  6   15  0

I tried groupby but that throws errors when you write to csv.

How did you get that dataframe in the first place, may I ask? — Anton vBR, Feb 12 '18 at 23:52

cs95 · Accepted Answer · 2018-02-12T23:53:54.770

2

This solution uses Divakar's justify function. If needed, convert to numeric in advance:

df = df.apply(pd.to_numeric, errors='coerce', axis=1)

Now, call groupby + transform:

df.set_index('Ids')\
  .groupby(level=0)\
  .transform(
      justify, invalid_val=np.nan, axis=0, side='up'
  )\
  .dropna(how='all')

          A     B     C     D    E     F     G     H     I    J
Ids                                                            
4411   24.0   2.0  55.0  26.0  1.0  24.0   2.0  54.0  26.0  0.0
4412   22.0   4.0  54.0  26.0  0.0  18.0   8.0  54.0  26.0  0.0
7401   12.0  14.0  54.0  26.0  0.0   0.0  25.0  53.0  26.0  0.0
7402   24.0   2.0  54.0  26.0  0.0  25.0   1.0  54.0  26.0  0.0
10891  16.0  10.0  54.0  26.0  0.0   3.0  23.0  54.0  26.0  0.0
10891   NaN   NaN   NaN   NaN  NaN   5.0  10.0   6.0  15.0  0.0

edited Feb 12 '18 at 23:53

answered Feb 12 '18 at 23:45

cs95

379,657
97
704
746

1

Nice solution! +1 – Anton vBR Feb 12 '18 at 23:54
@Wen I remember, also that it performed much worse than Divakar's answer to the same question ;) Edit: Thanks Anton :) – cs95 Feb 12 '18 at 23:54
@cᴏʟᴅsᴘᴇᴇᴅ yep. :-) But I still like your answer , easy to explain .:-) – BENY Feb 13 '18 at 00:01

score 2 · Answer 2 · answered Feb 12 '18 at 23:52

This should be slow , but can achieve what you need

df.replace('',np.nan).groupby('Ids').apply(lambda x: pd.DataFrame(x).apply(lambda x: sorted(x, key=pd.isnull),0)).dropna(axis=0,thresh=2).fillna('')
Out[539]: 
     Ids     A     B     C     D    E     F     G     H     I    J
0   7402  24.0   2.0  54.0  26.0  0.0  25.0   1.0  54.0  26.0  0.0
2  10891  16.0  10.0  54.0  26.0  0.0   3.0  23.0  54.0  26.0  0.0
3  10891                                5.0  10.0   6.0  15.0  0.0

score 0 · Answer 3 · answered Feb 13 '18 at 00:46

0

Assuming all the blank values are nan, another option using groupby and dropna:

df.loc[:,'A':'E'] = df.groupby('Ids').apply(lambda x: x.loc[:,'A':'E'].ffill(limit=1))
df.dropna(subset=['F','G','H','I','J'])

answered Feb 13 '18 at 00:46

Allen Qin

19,507
8
51
67

Pandas Merge Rows with Duplicate Ids Conditionally, Suitable for to CSV

3 Answers3