0

I have a data frame that has a list of datetime by minutes (generally in hour increments), for example 2018-01-14 03:00, 2018-01-14 04:00, etc.

What I want to do is capture the number of consecutive records by the minute increment (some could be 60 others 15, etc.) that I define. Then, I want to associate the first and last reading time in the block.

Take the following data for instance:

id             reading_time     type
1              1/6/2018 00:00   Interval
1              1/6/2018 01:00   Interval
1              1/6/2018 02:00   Interval
1              1/6/2018 03:00   Interval
1              1/6/2018 06:00   Interval
1              1/6/2018 07:00   Interval
1              1/6/2018 09:00   Interval
1              1/6/2018 10:00   Interval
1              1/6/2018 14:00   Interval
1              1/6/2018 15:00   Interval

I would like the output to look like the following:

id  first_reading_time  last_reading_time   number_of_records   type
1   1/6/2018 00:00      1/6/2018 03:00      4                   Received
1   1/6/2018 04:00      1/6/2018 05:00      2                   Missed
1   1/6/2018 06:00      1/6/2018 07:00      2                   Received
1   1/6/2018 08:00      1/6/2018 08:00      1                   Missed
1   1/6/2018 09:00      1/6/2018 10:00      2                   Received
1   1/6/2018 11:00      1/6/2018 13:00      3                   Missed
1   1/6/2018 14:00      1/6/2018 15:00      2                   Received

Now, in this example there is only 1 day and I can write the code for one day. Many of the rows extend across multiple days.

Now, what I've been able to is capture this aggregation up to the point the first consecutive records come in, but not the next set using this code:

first_reading_time = df['reading_time'][0]
last_reaeding_time = df['reading_time'][idx_loc-1]

df = pd.DataFrame(data=d)
df.reading_time = pd.to_datetime(df.reading_time)
d = pd.Timedelta(60, 'm')
df = df.sort_values('reading_time', ascending=True)
consecutive = df.reading_time.diff().fillna(0).abs().le(d)
df['consecutive'] = consecutive
df.iloc[:idx_loc]
idx_loc = df.index.get_loc(consecutive.idxmin())

where the data frame 'd' represents the more granular level data up top. The line of code that sets the variable 'consecutive' tags each records as True or False based on the number of minutes difference between the current row and the previous. The variable idx_loc captures the number of rows that were consecutive, but it only captures the first set (in this case 1/6/2018 00:00 and 1/6/2018 00:03).

Any help is appreciated.

pol_guy
  • 457
  • 3
  • 8
  • 20
  • Hope the link will help https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range/46526249#46526249 – BENY Jan 14 '18 at 17:27
  • Does the `id` value play a role in how the rows are grouped? How was it determined that the rows labelled "Missed" have `id` value of 1? (If those rows were missed, couldn't they belong to any `id` or no `id`...?) – unutbu Jan 14 '18 at 17:59

1 Answers1

1
import pandas as pd 
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()

result = df.groupby('group')['reading_time'].agg(['first','last','count'])
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]

yields

                    first                last  count      type
group                                                         
1     2018-01-06 00:00:00 2018-01-06 03:00:00      4  Received
2     2018-01-06 04:00:00 2018-01-06 05:00:00      2    Missed
3     2018-01-06 06:00:00 2018-01-06 07:00:00      2  Received
4     2018-01-06 08:00:00 2018-01-06 08:00:00      1    Missed
5     2018-01-06 09:00:00 2018-01-06 10:00:00      2  Received
6     2018-01-06 11:00:00 2018-01-06 13:00:00      3    Missed
7     2018-01-06 14:00:00 2018-01-06 15:00:00      2  Received

You could use asfreq to expand the DataFrame to include missing rows:

df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()

#           reading_time   id      type
# 0  2018-01-06 00:00:00  1.0  Interval
# 1  2018-01-06 01:00:00  1.0  Interval
# 2  2018-01-06 02:00:00  1.0  Interval
# 3  2018-01-06 03:00:00  1.0  Interval
# 4  2018-01-06 04:00:00  NaN       NaN
# 5  2018-01-06 05:00:00  NaN       NaN
# 6  2018-01-06 06:00:00  1.0  Interval
# 7  2018-01-06 07:00:00  1.0  Interval
# 8  2018-01-06 08:00:00  NaN       NaN
# 9  2018-01-06 09:00:00  1.0  Interval
# 10 2018-01-06 10:00:00  1.0  Interval
# 11 2018-01-06 11:00:00  NaN       NaN
# 12 2018-01-06 12:00:00  NaN       NaN
# 13 2018-01-06 13:00:00  NaN       NaN
# 14 2018-01-06 14:00:00  1.0  Interval
# 15 2018-01-06 15:00:00  1.0  Interval

Next, use the NaNs in, say, the id column to identify groups:

df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()

then group by the group values to find first and last reading_times for each group:

result = df.groupby('group')['reading_time'].agg(['first','last','count'])

#                     first                last  count
# group                                               
# 1     2018-01-06 00:00:00 2018-01-06 03:00:00      4
# 2     2018-01-06 04:00:00 2018-01-06 05:00:00      2
# 3     2018-01-06 06:00:00 2018-01-06 07:00:00      2
# 4     2018-01-06 08:00:00 2018-01-06 08:00:00      1
# 5     2018-01-06 09:00:00 2018-01-06 10:00:00      2
# 6     2018-01-06 11:00:00 2018-01-06 13:00:00      3
# 7     2018-01-06 14:00:00 2018-01-06 15:00:00      2

Since the Missed and Received values alternate, they can be generated from the index:

types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]

To handle multiple frequencies on a per-id basis, you could use:

import pandas as pd 
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )

df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.sort_values(by='reading_time')
df = df.set_index('reading_time')
freqmap = {1:'1H', 2:'15T'}
df = df.groupby('id', group_keys=False).apply(
    lambda grp: grp.asfreq(freqmap[grp['id'][0]]))
df = df.reset_index(level='reading_time')

df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
grouped = df.groupby('group')
result = grouped['reading_time'].agg(['first','last','count'])
result['id'] = grouped['id'].agg('first')
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]

which yields

                    first                last  count   id      type
group                                                              
1     2018-01-06 00:00:00 2018-01-06 03:00:00      4  1.0  Received
2     2018-01-06 04:00:00 2018-01-06 05:00:00      2  NaN    Missed
3     2018-01-06 06:00:00 2018-01-06 07:00:00      2  1.0  Received
4     2018-01-06 07:15:00 2018-01-06 08:45:00      7  NaN    Missed
5     2018-01-06 09:00:00 2018-01-06 09:00:00      1  2.0  Received
6     2018-01-06 09:15:00 2018-01-06 09:45:00      3  NaN    Missed
7     2018-01-06 10:00:00 2018-01-06 10:00:00      1  2.0  Received
8     2018-01-06 10:15:00 2018-01-06 13:45:00     15  NaN    Missed
9     2018-01-06 14:00:00 2018-01-06 14:00:00      1  2.0  Received
10    2018-01-06 14:15:00 2018-01-06 14:45:00      3  NaN    Missed
11    2018-01-06 15:00:00 2018-01-06 15:00:00      1  2.0  Received

It seems plausible that "Missed" rows should not be associated with any id, but to bring the result a little closer to the one you posted, you could ffill to forward-fill NaN id values:

result['id'] = result['id'].ffill()

changes the result to

                    first                last  count  id      type
group                                                             
1     2018-01-06 00:00:00 2018-01-06 03:00:00      4   1  Received
2     2018-01-06 04:00:00 2018-01-06 05:00:00      2   1    Missed
3     2018-01-06 06:00:00 2018-01-06 07:00:00      2   1  Received
4     2018-01-06 07:15:00 2018-01-06 08:45:00      7   1    Missed
5     2018-01-06 09:00:00 2018-01-06 09:00:00      1   2  Received
6     2018-01-06 09:15:00 2018-01-06 09:45:00      3   2    Missed
7     2018-01-06 10:00:00 2018-01-06 10:00:00      1   2  Received
8     2018-01-06 10:15:00 2018-01-06 13:45:00     15   2    Missed
9     2018-01-06 14:00:00 2018-01-06 14:00:00      1   2  Received
10    2018-01-06 14:15:00 2018-01-06 14:45:00      3   2    Missed
11    2018-01-06 15:00:00 2018-01-06 15:00:00      1   2  Received
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Wow...this is great! Is it possible to turn the .asfreq('1H') function into variable minutes? For example, one ID may have readings that come in 60 minutes intervals and others may be 15 minutes yet others may be 5 minute intervals. – pol_guy Jan 14 '18 at 21:28
  • 1
    I modified the post above to show how you could handle multiple frequencies on a per-id basis. – unutbu Jan 14 '18 at 22:01
  • One last question, what if I wanted to take this to missing first or last hours of the day? For example, say we were missing the first 2 hours: 2018-01-06 00:00 and 2018-01-06 01:00 and/or the last 2 which would be 2018-01-06 22:00 and 2018-01-06 23:00. How could we catch those reading as missed? – pol_guy Jan 14 '18 at 22:33
  • 1
    I think the easiest way to handle that would be to add a new starting and/or ending row to `df`, proceed as above, and then change the last line to `result['type'] = types[result.index % 2 + 1]`. – unutbu Jan 14 '18 at 23:30