I have a pandas.DataFrame
, that looks like this:
| id | start | end |
|:--:|:-----------------:|:----------------:|
| a | 1/1/20 12:00 AM | 1/2/20 12:00 AM |
| b | 1/1/20 6:37 PM | 1/2/20 7:11 PM |
| b | 1/4/20 1:17 AM | |
| c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
| d | 4/19/20 8:45 PM | 4/23/20 12:13 AM |
| d | 11/21/20 12:00 AM | 3/2/21 12:00 AM |
I am trying to, for each id
, determine the min()
start
and the max()
end
. My issue is, in some cases, the end
can be blank and in that case, it should count as the max (i.e., the issue is not closed).
Ideally, the result would look like this:
| id | start | end |
|:--:|:---------------:|:----------------:|
| a | 1/1/20 12:00 AM | 1/2/20 12:00 AM |
| b | 1/1/20 6:37 PM | |
| c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
| d | 4/19/20 8:45 PM | 3/2/21 12:00 AM |
I have looked at this question for inspiration but have not found a solution.
MRE below:
import pandas as pd, numpy as np
df = pd.DataFrame.from_dict({'id': {0: 'a', 1: 'b', 2: 'b', 3: 'c', 4: 'd', 5: 'd'}, 'start': {0: '1/1/20 12:00 AM', 1: '1/1/20 6:37 PM', 2: '1/4/20 1:17 AM', 3: '2/4/20 12:00 AM', 4: '4/19/20 8:45 PM', 5: '11/21/20 12:00 AM'}, 'end': {0: '1/2/20 12:00 AM', 1: '1/2/20 7:11 PM', 2: np.nan, 3: '7/13/20 12:00 AM', 4: '4/23/20 12:13 AM', 5: '3/2/21 12:00 AM'}})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
starts = df.groupby('id')['start'].min().reset_index()
ends = df.groupby('id')['end'].max().reset_index()
_df = pd.merge(starts, ends, on='id')
However, that produces:
| id | start | end |
|:--:|:---------------:|:--------------:|
| a | 1/1/2020 0:00 | 1/2/2020 0:00 |
| b | 1/1/2020 18:37 | 1/2/2020 19:11 |
| c | 2/4/2020 0:00 | 7/13/2020 0:00 |
| d | 4/19/2020 20:45 | 3/2/2021 0:00 |
How can I accomplish my desired result?