1

I have the following df and I want to merge the lines that have the same Ids, unless there are duplicates

Ids      A   B   C   D   E  F   G    H   I  J
4411    24  2   55  26  1                   
4411                        24  2   54  26  0
4412    22  4   54  26  0                   
4412                        18  8   54  26  0
7401    12  14  54  26  0                   
7401                        0   25  53  26  0
7402    24  2   54  26  0                   
7402                        25  1   54  26  0
10891   16  10  54  26  0                   
10891                       3   23  54  26  0
10891                       5   10  6   15  0

Example output

Ids      A   B   C   D   E  F   G    H   I  J
4411    24  2   55  26  1   24  2   54  26  0                               
4412    22  4   54  26  0   18  8   54  26  0                                   
7401    12  14  54  26  0   0   25  53  26  0                                   
7402    24  2   54  26  0   25  1   54  26  0                                   
10891   16  10  54  26  0   3   23  54  26  0               
10891                       5   10  6   15  0

I tried groupby but that throws errors when you write to csv.

Heather
  • 129
  • 1
  • 11

3 Answers3

2

This solution uses Divakar's justify function. If needed, convert to numeric in advance:

df = df.apply(pd.to_numeric, errors='coerce', axis=1)

Now, call groupby + transform:

df.set_index('Ids')\
  .groupby(level=0)\
  .transform(
      justify, invalid_val=np.nan, axis=0, side='up'
  )\
  .dropna(how='all')

          A     B     C     D    E     F     G     H     I    J
Ids                                                            
4411   24.0   2.0  55.0  26.0  1.0  24.0   2.0  54.0  26.0  0.0
4412   22.0   4.0  54.0  26.0  0.0  18.0   8.0  54.0  26.0  0.0
7401   12.0  14.0  54.0  26.0  0.0   0.0  25.0  53.0  26.0  0.0
7402   24.0   2.0  54.0  26.0  0.0  25.0   1.0  54.0  26.0  0.0
10891  16.0  10.0  54.0  26.0  0.0   3.0  23.0  54.0  26.0  0.0
10891   NaN   NaN   NaN   NaN  NaN   5.0  10.0   6.0  15.0  0.0
cs95
  • 379,657
  • 97
  • 704
  • 746
2

This should be slow , but can achieve what you need

df.replace('',np.nan).groupby('Ids').apply(lambda x: pd.DataFrame(x).apply(lambda x: sorted(x, key=pd.isnull),0)).dropna(axis=0,thresh=2).fillna('')
Out[539]: 
     Ids     A     B     C     D    E     F     G     H     I    J
0   7402  24.0   2.0  54.0  26.0  0.0  25.0   1.0  54.0  26.0  0.0
2  10891  16.0  10.0  54.0  26.0  0.0   3.0  23.0  54.0  26.0  0.0
3  10891                                5.0  10.0   6.0  15.0  0.0
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Assuming all the blank values are nan, another option using groupby and dropna:

df.loc[:,'A':'E'] = df.groupby('Ids').apply(lambda x: x.loc[:,'A':'E'].ffill(limit=1))
df.dropna(subset=['F','G','H','I','J'])
Allen Qin
  • 19,507
  • 8
  • 51
  • 67