4

I have a DataFrame like this:

df = pd.DataFrame([
    {'id': '001', 'score': 0.1, 'mat': np.array([2, 2, 0, 2, 0, 0, 0, 0, 0])},
    {'id': '001', 'score': 0.3, 'mat': np.array([0, 3, 0, 0, 3, 0, 0, 0, 0])},
    {'id': '001', 'score': 0.5, 'mat': np.array([0, 1, 0, 0, 0, 0, 0, 0, 0])},
    {'id': '002', 'score': 0.1, 'mat': np.array([2, 0, 0, 0, 0, 0, 2, 0, 0])},
    {'id': '002', 'score': 0.2, 'mat': np.array([0, 0, 4, 0, 0, 0, 0, 4, 0])},
])

What I want is to combine the mat with same id with following rules:

  1. If only one non-zero at the same location, then choose that value as new mat;
  2. If there're multiple of them, chose the one with highest score.

Here's the result I expected:

id  mat
001 [2, 1, 0, 2, 3, 0, 0, 0, 0]
002 [2, 0, 4, 0, 0, 0, 2, 4, 0]

What I've tried:

df.drop(columns=['score']).groupby(['id']).sum()

This will ignore the score then add all mat under same id, it's correct if only one non-zero at the same location.

df.drop(columns=['score']).groupby(['id']).agg('sum')

This will yield the same wrong result ([2, 9, 0, 2, 3, 0, 0, 0, 0] for id 001). However, I wonder if there's anyway to write a custom function for .agg() to achieve my goal?
Thanks for your help!

mjy
  • 43
  • 4

2 Answers2

3

It got a little too long, but you can try:

>>> greatest_score_nonzero = lambda x: pd.DataFrame(
                             x.sort_values('score')['mat'].tolist()
                             ).replace(0, method='ffill').iloc[-1].tolist()
>>> df.groupby('id').apply(greatest_score_nonzero).to_frame('mat')

                             mat
id                              
001  [2, 3, 0, 2, 3, 0, 0, 0, 0]
002  [2, 0, 4, 0, 0, 0, 2, 4, 0]

greatest_score_nonzero is a function that sorts df[['score', 'mat']] by score and replaces the 0s with ffill method, and takes the last value.

Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
  • 1
    Thanks, this is concise and solve the example problem, upvoted! I choose jezrael's answer because his solution is more general that can also apply to 2D array. It's my bad that I didn't provide a 2D array example at first. – mjy Feb 02 '21 at 14:58
  • @mjy I concur with your choice, I would have done the same had I been in your place. – Sayandip Dutta Feb 02 '21 at 15:05
2

I use custom function for shift non 0 value to first row of 2d array, but first sorting groups by score:

#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

f = lambda x: justify(np.array(x.tolist()), axis=0, side='down')[-1].tolist()
df = df.sort_values(['id','score']).groupby(['id'])['mat'].agg(f).reset_index()
print (df)
    id                          mat
0  001  [2, 1, 0, 2, 3, 0, 0, 0, 0]
1  002  [2, 0, 4, 0, 0, 0, 2, 4, 0]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252