1

I have a pandas DataFrame that looks like this:

           record_date           userid        id   priority
1   2016-05-27 02:00:39.600     1rhNGfQjU6  2718376     3
2   2016-05-27 02:00:39.600     EveMoYR1gs  2718377     3
3   2016-05-27 02:00:39.600     iVYGQgU3bX  2718378     3
4   2016-05-27 02:00:39.600     adA9fRNIgo  2718379     3
5   2016-05-27 02:00:39.600     rCDTlqTOXB  2718380     3
6   2016-05-27 02:00:39.600     aBI6JkLyal  2718381     3
7   2016-05-27 02:00:39.600     eiEct977ua  2718382     3
8   2016-05-27 02:00:39.600     7XVMWZPcZL  2718383     3
9   2016-05-27 02:00:39.600     GHajQM9UXN  2718384     3

It's not evident here, but there can be more than one record per user per day. I am trying to find a way to identify the id that corresponds to the lowest priority value per user per day. I think I may be having a problem with tie breaking because I tried suggestions from another SO post (Python : Getting the Row which has the max value in groups using groupby) but that logic would seem to select all records equal to the min, whereas I really need just one (in that case randomly chosen) record with the min priority per user. I know the code above isn't getting that for me because

len(set(df[indices]['userid'])) == len(df[indices]['userid'])

is False. What's the best way to achieve this? I understand why the code above doesn't work (since it returns True for those records equal to the min). What's a good way to break the tie?

Community
  • 1
  • 1
helloB
  • 3,472
  • 10
  • 40
  • 87

1 Answers1

1

You need groupby by column userid and date of datetime by date with idxmin - get all indexes with minimal priority per groups and for select all records use loc:

df['record_date'] = pd.to_datetime(df.record_date)

print (df.loc[df.priority.groupby([df.userid, df.record_date.dt.date]).idxmin()])

Sample:

import pandas as pd

df = pd.DataFrame({'record_date': {1: '2016-05-27 02:00:39.600', 2: '2016-05-27 02:00:39.600', 3: '2016-05-27 02:00:39.600', 4: '2016-05-27 02:00:39.600', 5: '2016-05-27 02:00:39.600', 6: '2016-05-27 02:00:39.600', 7: '2016-05-27 02:00:39.600', 8: '2016-05-27 02:00:39.600', 9: '2016-05-27 02:00:39.600'}, 'id': {1: 2718376, 2: 2718377, 3: 2718378, 4: 2718379, 5: 2718380, 6: 2718381, 7: 2718382, 8: 2718383, 9: 2718384}, 'priority': {1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}, 'userid': {1: '1rhNGfQjU6', 2: '1rhNGfQjU6', 3: '1rhNGfQjU6', 4: 'adA9fRNIgo', 5: 'adA9fRNIgo', 6: 'adA9fRNIgo', 7: 'eiEct977ua', 8: 'eiEct977ua', 9: 'eiEct977ua'}})
print (df)
        id  priority              record_date      userid
1  2718376         1  2016-05-27 02:00:39.600  1rhNGfQjU6
2  2718377         2  2016-05-27 02:00:39.600  1rhNGfQjU6
3  2718378         3  2016-05-27 02:00:39.600  1rhNGfQjU6
4  2718379         4  2016-05-27 02:00:39.600  adA9fRNIgo
5  2718380         5  2016-05-27 02:00:39.600  adA9fRNIgo
6  2718381         6  2016-05-27 02:00:39.600  adA9fRNIgo
7  2718382         7  2016-05-27 02:00:39.600  eiEct977ua
8  2718383         8  2016-05-27 02:00:39.600  eiEct977ua
9  2718384         9  2016-05-27 02:00:39.600  eiEct977ua

df['record_date'] = pd.to_datetime(df.record_date)
print (df.loc[df.priority.groupby([df.userid, df.record_date.dt.day]).idxmin()])
        id  priority             record_date      userid
1  2718376         1 2016-05-27 02:00:39.600  1rhNGfQjU6
4  2718379         4 2016-05-27 02:00:39.600  adA9fRNIgo
7  2718382         7 2016-05-27 02:00:39.600  eiEct977ua
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252