I have been working through some pandas problems to get better at the fundamentals, and there is this one issue I consistently run into while trying to solve problems. Whenever I perform a groupby
operation on a DataFrame
followed by a series of other operations to get what I need, I have lost important information (other descriptive columns) about my output, and going back to retrieve the same is a new, typically more complicated problem to solve.
Take a look at this problem for instance:
Given a phone log table that has information about callers' call history, find out the callers whose first and last calls were to the same person on a given day. Output the caller ID, recipient ID, and the date called.
Input: caller_history
:
caller_id
:int64
recipient_id
:int64
date_called
:datetime64[ns]
Here is the code that I wrote for it:
import pandas as pd
caller_history['date_called'] = pd.to_datetime(caller_history['date_called'], format='%Y-%m-%d %H:%M:%S')
caller_history = caller_history.sort_values(by=['caller_id', 'date_called'])
grouping = caller_history.groupby(['caller_id', caller_history['date_called'].apply(lambda x: x.day)])
grouping.apply(lambda x: x['recipient_id'].iloc[0] == x['recipient_id'].iloc[-1]).reset_index()
My thought process was:
- Convert the
date_called
column to a format python can understand - Sort
caller_history
, first by thecaller_id
(bunching up together all the calls by a particular user) and then by thedate_called
- Create groups within
caller_history
, first bycaller_id
to isolate users and then by the day (which is what the call toapply
is doing). - Now finally check if, within the groups, the
recipient_id
matches on the first and last calls of every day.
This gives me 'fundamentally' the correct solution. But now, I have no way of retrieving the recipient_id
and date_called
. The call to reset_index()
is a last-ditch effort to recover whatever possible, in this case, I can recover the caller_id
and the day
portion of date_called
.