I have a big dataframe looking like this:
Id last_item_bought time
'user1' 'bike' 2018-01-01
'user3' 'spoon' 2018-01-01
'user2' 'car' 2018-01-01
'user1' 'spoon' 2018-01-02
'user2' 'bike' 2018-01-02
'user3' 'paper' 2018-01-03
Each user has either 0 or 1 row per day.
I want a Dataframe with unique users and the latest latest_bought entry:
Id last_item_bought time
'user1' 'spoon'
'user2' 'bike'
'user3' 'paper'
The data is saved in a file per day fashion, which brings me to two 2 possible starting points:
- Load all data into a dask array and then somehow filter out rows with users, which have newer entries.
- Loop over the days from latest to oldest, load each day into a pandas Dataframe and somehow and add only users to a new dataframe, which have no newer entries (are not already in the new dataframe).
I'm looking for a solution with good performance. Each day can have several thousands of rows and I have to check over weeks.