Take min and max with null values - pandas groupby

Question

I have a pandas.DataFrame, that looks like this:

| id |       start       |        end       |
|:--:|:-----------------:|:----------------:|
|  a |  1/1/20 12:00 AM  |  1/2/20 12:00 AM |
|  b |   1/1/20 6:37 PM  |  1/2/20 7:11 PM  |
|  b |   1/4/20 1:17 AM  |                  |
|  c |  2/4/20 12:00 AM  | 7/13/20 12:00 AM |
|  d |  4/19/20 8:45 PM  | 4/23/20 12:13 AM |
|  d | 11/21/20 12:00 AM |  3/2/21 12:00 AM |

I am trying to, for each id, determine the min() start and the max() end. My issue is, in some cases, the end can be blank and in that case, it should count as the max (i.e., the issue is not closed).

Ideally, the result would look like this:

| id |      start      |        end       |
|:--:|:---------------:|:----------------:|
|  a | 1/1/20 12:00 AM |  1/2/20 12:00 AM |
|  b |  1/1/20 6:37 PM |                  |
|  c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
|  d | 4/19/20 8:45 PM |  3/2/21 12:00 AM |

I have looked at this question for inspiration but have not found a solution.

MRE below:

import pandas as pd, numpy as np

df = pd.DataFrame.from_dict({'id': {0: 'a', 1: 'b', 2: 'b', 3: 'c', 4: 'd', 5: 'd'}, 'start': {0: '1/1/20 12:00 AM', 1: '1/1/20 6:37 PM', 2: '1/4/20 1:17 AM', 3: '2/4/20 12:00 AM', 4: '4/19/20 8:45 PM', 5: '11/21/20 12:00 AM'}, 'end': {0: '1/2/20 12:00 AM', 1: '1/2/20 7:11 PM', 2: np.nan, 3: '7/13/20 12:00 AM', 4: '4/23/20 12:13 AM', 5: '3/2/21 12:00 AM'}})

df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])

starts = df.groupby('id')['start'].min().reset_index()
ends = df.groupby('id')['end'].max().reset_index()

_df = pd.merge(starts, ends, on='id')

However, that produces:

| id |      start      |       end      |
|:--:|:---------------:|:--------------:|
|  a |  1/1/2020 0:00  |  1/2/2020 0:00 |
|  b |  1/1/2020 18:37 | 1/2/2020 19:11 |
|  c |  2/4/2020 0:00  | 7/13/2020 0:00 |
|  d | 4/19/2020 20:45 |  3/2/2021 0:00 |

How can I accomplish my desired result?

ansev · Accepted Answer · 2022-02-28T19:43:46.023

IIUC,DataFrame.mask to set NaN where there are any nan for each group and col

new_df = \
df.groupby('id')\
  .agg({'start':'min', 'end':'max'})\
  .mask(df[['start', 'end']].isna()
                            .groupby(df['id'])
                            .max())\
  .reset_index()

print(new_df)
  id               start        end
0  a 2020-01-01 00:00:00 2020-01-02
1  b 2020-01-01 18:37:00        NaT
2  c 2020-02-04 00:00:00 2020-07-13
3  d 2020-04-19 20:45:00 2021-03-02

Detail:

print(df[['start', 'end']].isna()
                            .groupby(df['id'])
                            .max())

    start    end
id              
a   False  False
b   False   True
c   False  False
d   False  False

In the case of multiple columns to group by:

new_df = \
df.groupby(['id', 'status'])\
  .agg({'start':'min', 'end':'max'})\
  .mask(df[['start', 'end']].isna()
                            .groupby([df['id'], df['status']])
                            .max())\
  .reset_index()

d.b · Answer 2 · 2022-02-28T19:04:01.487

0

Sort by dates and use iloc to get the last value

df.sort_values(["start", "end"]).groupby("id").agg({"start": "first",
                                                    "end": lambda x: x.iloc[-1]})
#                  start        end
# id                               
# a  2020-01-01 00:00:00 2020-01-02
# b  2020-01-01 18:37:00        NaT
# c  2020-02-04 00:00:00 2020-07-13
# d  2020-04-19 20:45:00 2021-03-02

edited Feb 28 '22 at 19:04

answered Feb 28 '22 at 19:02

d.b

32,245
6
36
77

i think lambda here is slow, but it is original – ansev Feb 28 '22 at 19:03

Take min and max with null values - pandas groupby

2 Answers2