1

I have been working through some pandas problems to get better at the fundamentals, and there is this one issue I consistently run into while trying to solve problems. Whenever I perform a groupby operation on a DataFrame followed by a series of other operations to get what I need, I have lost important information (other descriptive columns) about my output, and going back to retrieve the same is a new, typically more complicated problem to solve.

Take a look at this problem for instance:

Given a phone log table that has information about callers' call history, find out the callers whose first and last calls were to the same person on a given day. Output the caller ID, recipient ID, and the date called.

Input: caller_history:

  • caller_id: int64
  • recipient_id: int64
  • date_called: datetime64[ns]

Here is the code that I wrote for it:

import pandas as pd

caller_history['date_called'] = pd.to_datetime(caller_history['date_called'], format='%Y-%m-%d %H:%M:%S')

caller_history = caller_history.sort_values(by=['caller_id', 'date_called'])

grouping = caller_history.groupby(['caller_id', caller_history['date_called'].apply(lambda x: x.day)])
grouping.apply(lambda x: x['recipient_id'].iloc[0] == x['recipient_id'].iloc[-1]).reset_index()

My thought process was:

  1. Convert the date_called column to a format python can understand
  2. Sort caller_history, first by the caller_id (bunching up together all the calls by a particular user) and then by the date_called
  3. Create groups within caller_history, first by caller_id to isolate users and then by the day (which is what the call to apply is doing).
  4. Now finally check if, within the groups, the recipient_id matches on the first and last calls of every day.

This gives me 'fundamentally' the correct solution. But now, I have no way of retrieving the recipient_id and date_called. The call to reset_index() is a last-ditch effort to recover whatever possible, in this case, I can recover the caller_id and the day portion of date_called.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Jgl
  • 13
  • 2
  • 1
    FWIW, if the input `date_called` is `datetime64[ns]`, you don't need to convert it with `to_datetime`, and you can use the `.dt` accessor instead of `.apply`: `caller_history['date_called'].dt.day`. – wjandrea May 27 '23 at 18:29
  • Please provide some example data, desired output, and actual output. See [How to make good reproducible pandas examples](/q/20109391/4518341). See also [mre]. – wjandrea May 27 '23 at 18:34
  • You need to save the .goupby result into a new dataframe and then merge the new dataframe with the old one – gtomer May 27 '23 at 19:17

2 Answers2

0

You can use pd.Grouper to group by date instead of converting to day, which will retain the whole date information:

caller_history.groupby(
    ["caller_id", pd.Grouper(key="date_called", freq="D")])\
    .agg(first=("recipient_id", "first"),
         match=("recipient_id", "last"))\
        .diff(axis=1)["match"].eq(0)\
            .reset_index(drop=False)

In each row:

  1. Groups by by "caller_id" and date
  2. Applies first and last aggregation, naming columns "first" and "match"
  3. takes the difference between the two, and checks if equal to 0
  4. Resets index so other columns as columns.

If you wanted to return a list of the recipient_id values for each date too, you could change the code to the following:

caller_history.groupby(
    ["caller_id", pd.Grouper(key="date_called", freq="D")])\
    .agg(first=("recipient_id", "first"),
         last=("recipient_id", "last"),
         ids=("recipient_id", list))\
        .assign(match=(lambda x: x["first"] - x["last"] == 0))\
            [["ids", "match"]].reset_index(drop=False)

Where .assign assigns a new column, in this case "match" as a Boolean of equal "last" and "first" columns, and "ids" has also been created as a list.

Rawson
  • 2,637
  • 1
  • 5
  • 14
  • This is _really_ interesting, I was not aware about `pd.Grouper`. Using this, I am able to retain the `date_called` column. Is there any way I can append/retain the `recipient_id` column while performing the grouping as well, or would I have to index into `caller_history` again to get those values back? – Jgl May 27 '23 at 19:58
  • So you don't want the data to be grouped? If you want to retain the `recipient_id`, you would either need to merge back to the original data, or convert the ids to a list for each date. Unless you want to return the last/first id? – Rawson May 27 '23 at 21:40
  • I have updated the solution to return a list of values. – Rawson May 27 '23 at 21:49
  • Great use of `Grouper`. However, OP wants the recipient of first and last call within each day but your solution returns first and last recipient_id within a day. – Parfait May 27 '23 at 23:29
0

Consider groupby().transform() to filter two sets of the call data by first and last calls of each day by caller. Then join to the sets and compare their respective recipient. (In SQL speak, this is similar to using window functions followed by self-joins).

# ADD HELPER COLUMNS
call_history = (
    call_history.assign(
        call_date = lambda df: df["date_called"].dt.normalize(),
        first_call = lambda df: df.groupby(
            ["caller_id", "call_date"]
        )["date_called"].transform("min"),
        last_call = lambda df: df.groupby(
            ["caller_id", "call_date"]
        )["date_called"].transform("max"),
        num_calls = lambda df: df.groupby(
            ["caller_id", "call_date"]
        )["date_called"].transform("count")
    )
)

# JOIN FIRST AND LAST CALL FILTERED DATASETS
call_compare_df = (
    call_history
        .query("date_called == first_date")
        .merge(
            call_history.query("date_called == last_date"),
            on = ["caller_id", "call_date", "num_calls"],
            suffixes = ["_first", "_last"]
        ).query("recipient_id_first == recipient_id_last")
        .query("num_calls > 1")
)
Parfait
  • 104,375
  • 17
  • 94
  • 125