import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
result = df.groupby('group')['reading_time'].agg(['first','last','count'])
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
yields
first last count type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 Received
4 2018-01-06 08:00:00 2018-01-06 08:00:00 1 Missed
5 2018-01-06 09:00:00 2018-01-06 10:00:00 2 Received
6 2018-01-06 11:00:00 2018-01-06 13:00:00 3 Missed
7 2018-01-06 14:00:00 2018-01-06 15:00:00 2 Received
You could use asfreq
to expand the DataFrame to include missing rows:
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
# reading_time id type
# 0 2018-01-06 00:00:00 1.0 Interval
# 1 2018-01-06 01:00:00 1.0 Interval
# 2 2018-01-06 02:00:00 1.0 Interval
# 3 2018-01-06 03:00:00 1.0 Interval
# 4 2018-01-06 04:00:00 NaN NaN
# 5 2018-01-06 05:00:00 NaN NaN
# 6 2018-01-06 06:00:00 1.0 Interval
# 7 2018-01-06 07:00:00 1.0 Interval
# 8 2018-01-06 08:00:00 NaN NaN
# 9 2018-01-06 09:00:00 1.0 Interval
# 10 2018-01-06 10:00:00 1.0 Interval
# 11 2018-01-06 11:00:00 NaN NaN
# 12 2018-01-06 12:00:00 NaN NaN
# 13 2018-01-06 13:00:00 NaN NaN
# 14 2018-01-06 14:00:00 1.0 Interval
# 15 2018-01-06 15:00:00 1.0 Interval
Next, use the NaNs in, say, the id
column to identify groups:
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
then group by the group
values to find first
and last
reading_times
for each group:
result = df.groupby('group')['reading_time'].agg(['first','last','count'])
# first last count
# group
# 1 2018-01-06 00:00:00 2018-01-06 03:00:00 4
# 2 2018-01-06 04:00:00 2018-01-06 05:00:00 2
# 3 2018-01-06 06:00:00 2018-01-06 07:00:00 2
# 4 2018-01-06 08:00:00 2018-01-06 08:00:00 1
# 5 2018-01-06 09:00:00 2018-01-06 10:00:00 2
# 6 2018-01-06 11:00:00 2018-01-06 13:00:00 3
# 7 2018-01-06 14:00:00 2018-01-06 15:00:00 2
Since the Missed
and Received
values alternate, they can be generated from the index:
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
To handle multiple frequencies on a per-id basis, you could use:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.sort_values(by='reading_time')
df = df.set_index('reading_time')
freqmap = {1:'1H', 2:'15T'}
df = df.groupby('id', group_keys=False).apply(
lambda grp: grp.asfreq(freqmap[grp['id'][0]]))
df = df.reset_index(level='reading_time')
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
grouped = df.groupby('group')
result = grouped['reading_time'].agg(['first','last','count'])
result['id'] = grouped['id'].agg('first')
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
which yields
first last count id type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 1.0 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 NaN Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 1.0 Received
4 2018-01-06 07:15:00 2018-01-06 08:45:00 7 NaN Missed
5 2018-01-06 09:00:00 2018-01-06 09:00:00 1 2.0 Received
6 2018-01-06 09:15:00 2018-01-06 09:45:00 3 NaN Missed
7 2018-01-06 10:00:00 2018-01-06 10:00:00 1 2.0 Received
8 2018-01-06 10:15:00 2018-01-06 13:45:00 15 NaN Missed
9 2018-01-06 14:00:00 2018-01-06 14:00:00 1 2.0 Received
10 2018-01-06 14:15:00 2018-01-06 14:45:00 3 NaN Missed
11 2018-01-06 15:00:00 2018-01-06 15:00:00 1 2.0 Received
It seems plausible that "Missed" rows should not be associated with any id
, but to bring the result a little closer to the one you posted, you could ffill
to forward-fill NaN id values:
result['id'] = result['id'].ffill()
changes the result to
first last count id type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 1 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 1 Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 1 Received
4 2018-01-06 07:15:00 2018-01-06 08:45:00 7 1 Missed
5 2018-01-06 09:00:00 2018-01-06 09:00:00 1 2 Received
6 2018-01-06 09:15:00 2018-01-06 09:45:00 3 2 Missed
7 2018-01-06 10:00:00 2018-01-06 10:00:00 1 2 Received
8 2018-01-06 10:15:00 2018-01-06 13:45:00 15 2 Missed
9 2018-01-06 14:00:00 2018-01-06 14:00:00 1 2 Received
10 2018-01-06 14:15:00 2018-01-06 14:45:00 3 2 Missed
11 2018-01-06 15:00:00 2018-01-06 15:00:00 1 2 Received