3

I got a list of different events with mixed occurrences. For instance the event1 might occur three times, then another event and later on event1 will occur again.

What I need is the intervals for each event and the number of occurrences of that event in those intervals.

values = {
        '2017-11-28 11:00': 'event1',
        '2017-11-28 11:01': 'event1',
        '2017-11-28 11:02': 'event1',
        '2017-11-28 11:03': 'event2',
        '2017-11-28 11:04': 'event2',
        '2017-11-28 11:05': 'event1',
        '2017-11-28 11:06': 'event1',
        '2017-11-28 11:07': 'event1',
        '2017-11-28 11:08': 'event3',
        '2017-11-28 11:09': 'event3',
        '2017-11-28 11:10': 'event2',
        }

import pandas as pd
df = pd.DataFrame.from_dict(values, orient='index').reset_index()
df.columns = ['time', 'event']
df['time'] = df['time'].apply(pd.to_datetime)
df.set_index('time', inplace=True)
df.sort_index(inplace=True)
df.head()

The expected result is:

occurrences = [
        {'start':'2017-11-28 11:00',
         'end':'2017-11-28 11:02',
         'event':'event1',
         'count':3},
        {'start':'2017-11-28 11:03',
         'end':'2017-11-28 11:04',
         'event':'event2',
         'count':2},
        {'start':'2017-11-28 11:05',
         'end':'2017-11-28 11:07',
         'event':'event1',
         'count':3},
        {'start':'2017-11-28 11:08',
         'end':'2017-11-28 11:09',
         'event':'event3',
         'count':2},
        {'start':'2017-11-28 11:10',
         'end':'2017-11-28 11:10',
         'event':'event2',
         'count':1},
        ]

I was thinking to use pd.merge_asof to find the start/end times of the intervals and the use pd.cut (as explained here) for groupby and count. But somehow I'm stuck. Any help is appreciated.

Matthias
  • 5,574
  • 8
  • 61
  • 121
  • are you only looking got pandas based solution? You can iterate over it, for non pandas based solution. – harshil9968 Nov 28 '17 at 09:26
  • I bet that there is a nice Pandas solution. I found [this one](https://stackoverflow.com/questions/26539646/python-pandas-count-interval-of-occurrence) that is based on iteration. But Pandas would be preferable. – Matthias Nov 28 '17 at 09:30
  • You may even look at here https://stackoverflow.com/questions/25119524/pandas-conditional-rolling-count , – Vivek Harikrishnan Nov 28 '17 at 09:44
  • @VivekHarikrishnan very nice idea. Good starting point. The problem remains how to group them by the intervals - meaning how do you know when the **highest cumcount** for an interval has been reached? PS: I found a solution. If you post yours, I will accept it. Otherwise I'll post mine. – Matthias Nov 28 '17 at 10:02
  • Based on @chrisb's answer in the question @Vivek Harkrishnan mentioned, you could do `groupby` on `block` then choose `min('time')` as `start` and `max('time')` as `end` – Yuan JI Nov 28 '17 at 10:10

2 Answers2

2

Try the following approach:

In [68]: x = df.reset_index()

In [69]: (x.groupby(x.event.ne(x.event.shift()).cumsum())
    ...:   .apply(lambda x:
    ...:             pd.DataFrame({
    ...:                 'start':[x['time'].min()],
    ...:                 'end':[x['time'].min()],
    ...:                 'event':[x['event'].iloc[0]],
    ...:                 'count':[len(x)]})
    ...:         )
    ...:   .reset_index(drop=True)
    ...:   .to_dict('r')
    ...: )
Out[69]:
[{'count': 3,
  'end': Timestamp('2017-11-28 11:00:00'),
  'event': 'event1',
  'start': Timestamp('2017-11-28 11:00:00')},
 {'count': 2,
  'end': Timestamp('2017-11-28 11:03:00'),
  'event': 'event2',
  'start': Timestamp('2017-11-28 11:03:00')},
 {'count': 3,
  'end': Timestamp('2017-11-28 11:05:00'),
  'event': 'event1',
  'start': Timestamp('2017-11-28 11:05:00')},
 {'count': 2,
  'end': Timestamp('2017-11-28 11:08:00'),
  'event': 'event3',
  'start': Timestamp('2017-11-28 11:08:00')},
 {'count': 1,
  'end': Timestamp('2017-11-28 11:10:00'),
  'event': 'event2',
  'start': Timestamp('2017-11-28 11:10:00')}]

or the following if you want to have time column as strings:

In [75]: (x.groupby(x.event.ne(x.event.shift()).cumsum())
    ...:   .apply(lambda x:
    ...:             pd.DataFrame({
    ...:                 'start':[x['time'].min().strftime('%Y-%m-%d %H:%M:%S')],
    ...:                 'end':[x['time'].min().strftime('%Y-%m-%d %H:%M:%S')],
    ...:                 'event':[x['event'].iloc[0]],
    ...:                 'count':[len(x)]})
    ...:         )
    ...:   .reset_index(drop=True)
    ...:   .to_dict('r')
    ...: )
Out[75]:
[{'count': 3,
  'end': '2017-11-28 11:00:00',
  'event': 'event1',
  'start': '2017-11-28 11:00:00'},
 {'count': 2,
  'end': '2017-11-28 11:03:00',
  'event': 'event2',
  'start': '2017-11-28 11:03:00'},
 {'count': 3,
  'end': '2017-11-28 11:05:00',
  'event': 'event1',
  'start': '2017-11-28 11:05:00'},
 {'count': 2,
  'end': '2017-11-28 11:08:00',
  'event': 'event3',
  'start': '2017-11-28 11:08:00'},
 {'count': 1,
  'end': '2017-11-28 11:10:00',
  'event': 'event2',
  'start': '2017-11-28 11:10:00'}]
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

Here are two solutions. The first one is based on the link provided by vivek-harikrishnan and explained here. It creates continuous numbers for the intervals and cumulatively counts the occurrences within such intervals.

#%% first solution

# create intervals and count occurrences per interval
df['interval'] = (df['event'] != df['event'].shift(1)).astype(int).cumsum()
df['count'] = df.groupby(['event', 'interval']).cumcount() + 1

# now group by intervals
df.groupby('interval').last()

The second solution is based on the answer above given by maxu. Similar to the first idea it also creates interval numbers but also finds the start/end timestamp for such intervals.

#%% second solution

df = df.reset_index()
# create intervals
df = df.groupby(df['event'].ne(df['event'].shift()).cumsum())
# calc start/end times and count occurances at the same time
df.apply(lambda x: pd.DataFrame({
                    'start':[x['time'].min()], 
                    'end':[x['time'].max()],
                    'event':[x['event'].iloc[0]],
                    'count':[len(x)]})).reset_index(drop=True)
Matthias
  • 5,574
  • 8
  • 61
  • 121