0

I have a Pandas DataFrame that I need to:

  • group by the ID column (not in index)
  • forward fill rows to the right with the previous value (multiple columns) only if it's not a NaN (np.nan)

For each ID categorical value and each metric column (see the aX columns in the examples below) there is only value (the others when having multiple rows are NaN - np.nan).

Take this as an example:

In [1]: import numpy as np                                                                                                           

In [2]: import pandas as pd                                                                                                          

In [3]: my_df = pd.DataFrame([ 
   ...:     {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0}, 
   ...:     {"id": 1, "a1": np.nan, "a2": np.nan, "a3": 80.0, "a4": np.nan}, 
   ...:     {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan}, 
   ...:     {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0}, 
   ...: ])                                                                                                                           

In [4]: my_df.head(len(my_df))                                                                                                       
Out[4]: 
   id     a1  a2     a3    a4
0   1  100.0 NaN    NaN  90.0
1   1    NaN NaN   80.0   NaN
2  20    NaN NaN  100.0   NaN
3  20    NaN NaN    NaN  30.0

I have many more columns like a1 to a4.

I would like to:

  • pretend np.nan is zero 0.0 when on the same column and different row (with same ID) there is a number so I can sum them together like with groupby and subsequent aggregation functions
  • forward fill to the right on the same unique row (by ID) only if somewhere on a previous column to the left there was a number

So basically in the example this means that:

  • for ID 1 "a2"=100.0
  • for ID 2 "a1" and "a2" are both np.nan

See here:

In [5]: wanted_df = pd.DataFrame([ 
   ...:     {"id": 1, "a1": 100.0, "a2": 100.0, "a3": 80.0, "a4": 90.0}, 
   ...:     {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": 30.0}, 
   ...: ])                                                                                                                           

In [6]: wanted_df.head(len(wanted_df))                                                                                               
Out[6]: 
   id     a1     a2     a3    a4
0   1  100.0  100.0   80.0  90.0
1  20    NaN    NaN  100.0  30.0

In [7]: 

The forward filling to the right should apply to multiple columns on the same row, not only for the closest row to the right.

When I use my_df.interpolate(method='pad', axis=1,limit=None,limit_direction='forward',limit_area=None,downcast=None,) then I still get multiple rows for the same ID.

When I use my_df.groupby("id").sum() then I see 0.0 everywhere rather than retaining the NaN values in those scenarios defined above.

When I use my_df.groupby("id").apply(np.sum) the ID columns is summed as well, so this is wrong as it should be retained.

How do I do this?

TPPZ
  • 4,447
  • 10
  • 61
  • 106

1 Answers1

1

One idea is use min_count=1 to sum:

df = my_df.groupby("id").sum(min_count=1)
print (df)
       a1  a2     a3    a4
id                        
1   100.0 NaN   80.0  90.0
20    NaN NaN  100.0  30.0

Or if need first non missing value is possible use GroupBy.first:

df = my_df.groupby("id").first()
print (df)
       a1  a2     a3    a4
id                        
1   100.0 NaN   80.0  90.0
20    NaN NaN  100.0  30.0

More problematic is if multiple non missing values per groups and need all of them:

#added 20 to a1
my_df = pd.DataFrame([ 
     {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0}, 
      {"id": 1, "a1": 20, "a2": np.nan, "a3": 80.0, "a4": np.nan}, 
      {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan}, 
     {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0}, 
 ])   
print (my_df)              
   id     a1  a2     a3    a4
0   1  100.0 NaN    NaN  90.0
1   1   20.0 NaN   80.0   NaN
2  20    NaN NaN  100.0   NaN
3  20    NaN NaN    NaN  30.0

def f(x):
    return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))

df1 = (my_df.set_index('id')
            .groupby("id")
            .apply(f)
            .reset_index(level=1, drop=True)
            .reset_index())
print (df1)

   id     a1  a2     a3    a4
0   1  100.0 NaN   80.0  90.0
1   1   20.0 NaN    NaN   NaN
2  20    NaN NaN  100.0  30.0

First and second solution working differently:

df2 = my_df.groupby("id").sum(min_count=1)
print (df2)
       a1  a2     a3    a4
id                        
1   120.0 NaN   80.0  90.0
20    NaN NaN  100.0  30.0

df3 = my_df.groupby("id").first()
print (df3)
       a1  a2     a3    a4
id                        
1   100.0 NaN   80.0  90.0
20    NaN NaN  100.0  30.0

If same type of values, here numbers is possible also use:

#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

f = lambda x: pd.DataFrame(justify(x.to_numpy(), 
                                   invalid_val=np.nan, 
                                   axis=0, 
                                   side='up'), columns=my_df.columns.drop('id'))
                .dropna(how='all')
df1 = (my_df.set_index('id')
            .groupby("id")
            .apply(f)
            .reset_index(level=1, drop=True)
            .reset_index())
print (df1)
   id     a1  a2     a3    a4
0   1  100.0 NaN   80.0  90.0
1   1   20.0 NaN    NaN   NaN
2  20    NaN NaN  100.0  30.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Yes that `.sum(min_count=1)` works quite good. However I need to forward fill to the right that `100.0` in `a1` for ID=1, how do I do that even when multiple columns to the right of `a1` are `np.nan`? Also: for each column `aX` there is only 1 value for a given ID - so the scenario where `a1` is both `100.0` and `20.0` for ID=1 is not the one I have. Thanks – TPPZ Dec 03 '20 at 15:02
  • 1
    @TPPZ Not sure if understand, do you need `df = my_df.groupby("id").sum(min_count=1).ffill(axis=1)`? – jezrael Dec 03 '20 at 16:22