0

I have a data frame which look like this:

    ID     Time                Quantity   Type
    1     2019-04-03 05:51:54        0    Dog|Cat
    2     2019-05-03 05:51:54        0    Dog
    3     2019-04-02 05:51:54        0    Pig|Cat
    4     2019-04-03 05:51:54        2    
    4     2019-07-03 04:51:54        0    Dog|Cat|Pig
    3     2019-04-09 05:51:54        4    

There is also a dictonry to sort the type column:

    dictionary = {'Cat': 1,'Dog': 2,'Pig': 3}

I would like to group the table by ID and get the max for the other 3 columns. I have used this to group the first two:

    df.groupby(['ID']).agg({'Time':'max','Quantity':'max'})

What I'm having trouble with is integrating the additional code to get the max value from the dictionary into the aggregate function. I think it should be something like this:

    max(df['Type'].str.split(pat='|'), key=lambda s: dictionary[s])

My idea output would be:

    ID     Time                Quantity   Type
    1     2019-04-03 05:51:54        0    Dog
    2     2019-05-03 05:51:54        0    Dog
    3     2019-04-09 05:51:54        4    Pig
    4     2019-07-03 04:51:54        2    Pig

I'm sorry if I'm missing something elementary as I'm fairly new to using pandas

jol1234
  • 73
  • 1
  • 7

2 Answers2

2

Using a custom function for agg that calculates the max value for each ID and then does the reverse lookup:

d = {'Cat': 1, 'Dog': 2, 'Pig': 3}
rd = {v:k for k, v in d.items()} # reverse lookup dict

def f(z):
    return rd.get(max([d.get(y,-1)
                   for x in z.fillna('').str.split('|').values.tolist()
                   for y in x]), '')

df.groupby(['ID']).agg({'Time':'max', 'Quantity':'max', 'Type': f})

Output:

                   Time  Quantity Type
ID                                    
1   2019-04-03 05:51:54         0  Dog
2   2019-05-03 05:51:54         0  Dog
3   2019-04-09 05:51:54         4  Pig
4   2019-07-03 04:51:54         2  Pig
perl
  • 9,826
  • 1
  • 10
  • 22
0

This is more like unnest problem first, then become a category problem , after recognized that we just need groupby.agg to achieve the expected output

df.Type=df.Type.str.split('|')
s1=unnesting(df.dropna(),['Type'])
s1.Type=pd.Categorical(s1.Type,['Cat','Dog','Pig'],ordered=True)
s=s1.groupby(['ID']).Type.max()
o=df.groupby(['ID']).agg({'Time':'max','Quantity':'max'})
yourdf=pd.concat([s,o],axis=1)
yourdf
Out[371]: 
   Type                Time  Quantity
ID                                   
1   Dog 2019-04-03 05:51:54         0
2   Dog 2019-05-03 05:51:54         0
3   Pig 2019-04-09 05:51:54         4
4   Pig 2019-07-03 04:51:54         2
BENY
  • 317,841
  • 20
  • 164
  • 234