4

I have a big dataframe looking like this:

Id last_item_bought time
'user1' 'bike'  2018-01-01
'user3' 'spoon' 2018-01-01
'user2' 'car'   2018-01-01
'user1' 'spoon' 2018-01-02
'user2' 'bike'  2018-01-02
'user3' 'paper' 2018-01-03

Each user has either 0 or 1 row per day.

I want a Dataframe with unique users and the latest latest_bought entry:

Id last_item_bought time   
'user1' 'spoon' 
'user2' 'bike'  
'user3' 'paper'

The data is saved in a file per day fashion, which brings me to two 2 possible starting points:

  1. Load all data into a dask array and then somehow filter out rows with users, which have newer entries.
  2. Loop over the days from latest to oldest, load each day into a pandas Dataframe and somehow and add only users to a new dataframe, which have no newer entries (are not already in the new dataframe).

I'm looking for a solution with good performance. Each day can have several thousands of rows and I have to check over weeks.

dennis-w
  • 2,166
  • 1
  • 13
  • 23

1 Answers1

2

I think you need sort_values + drop_duplicates:

df = df.sort_values(['Id','time']).drop_duplicates('Id', keep='last')
print (df)
        Id last_item_bought        time
3  'user1'          'spoon'  2018-01-02
4  'user2'           'bike'  2018-01-02
5  'user3'          'paper'  2018-01-03

If need filter output columns:

df = df.sort_values(['Id','time']).drop_duplicates('Id', keep='last').drop('time', axis=1)
print (df)
        Id last_item_bought
3  'user1'          'spoon'
4  'user2'           'bike'
5  'user3'          'paper'

Dask solution (for sorting is used set_index):

df = pd.DataFrame({'Id': ['user1', 'user3', 'user2', 'user1', 'user2', 'user3'],
                   'time': ['2018-01-01', '2018-01-01', '2018-01-01', 
                            '2018-01-02', '2018-01-02', '2018-01-03'], 
                  'last_item_bought': ['bike', 'spoon', 'car', 'spoon', 'bike', 'paper']})
df['time'] = pd.to_datetime(df['time'])
print (df)
      Id last_item_bought       time
0  user1             bike 2018-01-01
1  user3            spoon 2018-01-01
2  user2              car 2018-01-01
3  user1            spoon 2018-01-02
4  user2             bike 2018-01-02
5  user3            paper 2018-01-03

from dask import dataframe as dd 
ddf = dd.from_pandas(df, npartitions=3)

ddf1 = (ddf.set_index('time')
          .drop_duplicates(subset=['Id'], keep='last')
          .set_index('Id')
          .reset_index()
          .compute())
print (ddf1)
      Id last_item_bought
0  user1            spoon
1  user2             bike
2  user3            paper
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thanks this seems like a really elegant solution. Unfortunately I didn't find sort_values in the dask dataframe api but since this requires a full shuffle it's maybe better to stick with the in-memory solution pandas and not over-complicate the problem. – dennis-w Feb 28 '18 at 10:06
  • 1
    @dennis-ec - You are welcome! I add also dask solution. – jezrael Feb 28 '18 at 10:34
  • The dask solution is not working because dask does not support the key arguments in drop_duplicates. They are in the docu but most of the dask docu is auto generated from the pandas docu. – dennis-w Mar 01 '18 at 09:17