1

I have a dataset of assignments per case. I'm trying to fill the last unassigned from each case with the case_closed date.

case_nb   attorney      cas_closed_date assigned    last_event
1         A             2015-07-02      2015-07-02  2015-07-02
2         B             2015-09-10      2015-09-10  2015-09-10
3         C             2016-03-24      2016-03-24  2016-03-24
4         D             2018-06-07      2013-10-21  2014-02-06
4         C             2018-06-07      2013-09-13  2013-09-13
4         F             2018-06-07      2018-03-31  2018-04-05

(For replication:

df = pd.DataFrame({'assigned': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10', 2: '2016-03-24', 3: '2013-10-21', 4: '2013-09-13', 5: '2018-03-31'},
                    'attorney': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'C', 5: 'F'},
                    'cas_closed_date': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10', 2: '2016-03-24', 3: '2018-06-07', 4: '2018-06-07', 5: '2018-06-07'},
                    'case_nb': {0: 1, 1: 2, 2: 3, 3: 4, 4: 4, 5: 4},
                    'last_event': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10', 2: '2016-03-24', 3: '2014-02-06', 4: '2013-09-13', 5: '2018-04-05'}}

)

My dataset includes the case number, attorney assigned, the case closed date, the date the attorney was assigned, and the last event the attorney appeared on. For the above example, I'd want the last line to be

4         F             2018-06-07      2018-03-31  2018-06-07

I've seen a few ways of filling in NAs based on data in the same column, such as this question. But these ways use transform, which I can't get to use multiple columns.

What I have so far works with apply, but not transform:

def fixdate(gp):
    last_unasgn = gp.iloc[-1]['last_event']
    if gp.iloc[-1]['cas_closed_date'] > last_unasgn:
        return gp.iloc[-1]['cas_closed_date']
    else:
       return last_unasgn

asmt.groupby('evt_file_number').apply(lambda x: fixdate(x))
> 4  2018-06-07

Is there a way to use transform? If not, what's the best way to use that aggregate data to fill in my original dataset?

RCA
  • 508
  • 4
  • 12
  • Hi, please share your `df.to_dict()` so it will be easier to help you. It is not clear to me what do you want to obtain: do you want to write the date the case is closed on last_event? Or do you want to do so just for the last attorney who received the case? – rpanai Jun 20 '18 at 20:41
  • 1
    @user32185 `to_dict` posted. The final `last_event` for a group should be the `cas_closed_date` for that group (which is the second option you wrote, I believe). – RCA Jun 20 '18 at 20:54

3 Answers3

1

The .transform operation passes all columns for each group as a one dimensional Series to the function, while the .apply operation passes all columns for each group as a DataFrame to the function. So, .transform works on one Series (or column) at a time and .apply works for an entire DataFrame (or all columns). Hope this helps.

t0bi
  • 76
  • 10
  • Thanks! I knew vaguely that's what it was doing, but your answer helped me think through how to create a flag. – RCA Jun 20 '18 at 21:13
1

In the case the same case_nb is never assigned on the same day to two different attorney you can try this solution.

import numpy as np
import pandas as pd

df = pd.DataFrame({'assigned': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10',
                                2: '2016-03-24', 3: '2013-10-21', 4: '2013-09-13',
                                5: '2018-03-31'},
                    'attorney': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'C', 5: 'F'},
                    'cas_closed_date': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10',
                                        2: '2016-03-24', 3: '2018-06-07', 4: '2018-06-07',
                                        5: '2018-06-07'},
                    'case_nb': {0: 1, 1: 2, 2: 3, 3: 4, 4: 4, 5: 4},
                    'last_event': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10',
                                   2: '2016-03-24', 3: '2014-02-06', 4: '2013-09-13',
                                   5: '2018-04-05'}})

out = df.groupby("case_nb")["assigned"].max().reset_index()
out["last"] = True
df = pd.merge(df, out, how="left", on= ["case_nb", "assigned"])
df["last_event"] = np.where(df["last"] == True, df["cas_closed_date"], df["last_event"])
del df["last"], out

Probably there is a more elegant solution but at least this is vectorial (see np.where) and don't use apply. pandas performaces.

EDIT In case you want to use transform there is this option which seems to me to be the fastest solution.

df["last_assigned"] = df.groupby("case_nb")["assigned"].transform("max")
df["last_event"] = np.where(df["assigned"]==df["last_assigned"], 
                            df["cas_closed_date"],
                            df["last_event"])
del df["last_assigned"]
rpanai
  • 12,515
  • 2
  • 42
  • 64
  • With `%%timeit%%, your `np.where` solution took `1.45 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)`, and my `loc` solution took `2.14 ms ± 72.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)`. So I would say `np.where` is the clear winner. – RCA Jun 21 '18 at 14:11
  • @RCA Did you try on your real case `df`? – rpanai Jun 21 '18 at 14:35
0

I ended up doing a combination of transform and just plain indexing.

df = df.sort_values(['case_nb', 'last_event'])
df['last_unassigned'] = df.groupby('case_nb')['last_event'].transform('last')
df.loc[(df['last_event'] == df['last_unassigned'])
        & (df['last_unassigned'] != df['cas_closed_date']), 'last_event'] = df['cas_closed_date']
RCA
  • 508
  • 4
  • 12
  • do you mind to test this solution vs my last one with your df to see which one is faster on the real case? Then I think this condition `df['last_unassigned'] != df['cas_closed_date']` is redundant. I mean in case these values are the same it doesn't harm update `df["last_event"]` with itself. – rpanai Jun 20 '18 at 22:13