Chunking DataFrame by gaps in datetime index

Question

First of all, my apologies if the title was too ambiguous.

I have a pd.DataFrame with datetime64 as a dtype of index. These indices, however, are not equally spaced: they mostly have a one-minute interval, but there are often some other intervals, such as two-minutes.

Suppose if I have a pd.DataFrame:

df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
               '2018-11-28 14:01:00', '2018-11-28 14:02:00',
               '2018-11-28 14:03:00', '2018-11-28 14:05:00',
               '2018-11-28 14:06:00', '2018-11-28 14:07:00',
               '2018-11-28 14:08:00', '2018-11-28 14:09:00'], 
                   'count': np.random.randint(1, 100, 10)})
datetime_index = pd.to_datetime(df['date'])
df = df.set_index(datetime_index).drop('date', 1)
df.sort_index(inplace=True)

such that df is:

    count
date    
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51<<< two minutes gap
2018-11-28 14:05:00 41<<< unlike others
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93

My goal is to divide the df into multiple chunks where each chunk have a consistent frequency of one-minute. Thus, the expected result from above would become:

#df0
    count
date    
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51
#df1
    count
date   
2018-11-28 14:05:00 41
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93

I have tried Split a series on time gaps in pandas? which sadly was quite outdated and did not serve my purpose.

I did actually achieved what I want for the above sample, but the actual dataframe is much larger and has many more gaps, which make following solution extremely inefficient:

df['diff'] = pd.Series(df.index).diff().values
dif = pd.Series(df.index).diff()
gap_index = dif[dif == pd.to_timedelta(120000000000)].index[0]
df[:gap_index], df[gap_index:]

I would very much appreciate any insight on this issue

score 2 · Accepted Answer · answered Jan 28 '19 at 14:03

If you are interested to create a dictionary which will contain all your seperated dataframes, probably this should work:

df['identifier']=(~df.index.to_series().diff().dt.seconds.div(60, fill_value=0).lt(2)).cumsum()

                     count  identifier
date                                  
2018-11-28 13:59:00      7           0
2018-11-28 14:00:00     49           0
2018-11-28 14:01:00     13           0
2018-11-28 14:02:00     47           0
2018-11-28 14:03:00     72           0
2018-11-28 14:05:00     33           1
2018-11-28 14:06:00     50           1
2018-11-28 14:07:00     10           1
2018-11-28 14:08:00     86           1
2018-11-28 14:09:00     40           1

Post this create a dict and append the groups:

d = {}
for i,grp in df.groupby('identifier'):
    d.update(dict([('df_' + str(i),grp)]))
print(d)

Output:

{'df_0':                      count  identifier


date                                  
 2018-11-28 13:59:00      7           0
 2018-11-28 14:00:00     49           0
 2018-11-28 14:01:00     13           0
 2018-11-28 14:02:00     47           0
 2018-11-28 14:03:00     72           0,
 'df_1':                      count  identifier
 date                                  
 2018-11-28 14:05:00     33           1
 2018-11-28 14:06:00     50           1
 2018-11-28 14:07:00     10           1
 2018-11-28 14:08:00     86           1
 2018-11-28 14:09:00     40           1}

you can then see your data by calling the dict keys:

print(d['df_1'])
                     count  identifier
date                                  
2018-11-28 14:05:00     95           1
2018-11-28 14:06:00     68           1
2018-11-28 14:07:00     19           1
2018-11-28 14:08:00      9           1
2018-11-28 14:09:00     61           1

Thanks a lot :) Works nicely with large number of rows! – Chris Jan 29 '19 at 01:38 — Chris, Jan 29 '19 at 01:38

score 1 · Answer 2 · answered Jan 28 '19 at 12:46

Here's quick and dirty solution:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
           '2018-11-28 14:01:00', '2018-11-28 14:02:00',
           '2018-11-28 14:03:00', '2018-11-28 14:05:00',
           '2018-11-28 14:06:00', '2018-11-28 14:07:00',
           '2018-11-28 14:08:00', '2018-11-28 14:09:00'],
               'count': np.random.randint(1, 100, 10)})

df['date'] = pd.to_datetime(df['date'])
df.sort_index(inplace=True)

# calculate where to cut
df['cut_point'] = pd.to_datetime(df.date.shift(-1))[0:len(df)-1]- df.date[0:len(df)-1] > '00:01:00'
df['cut_point'] = df['cut_point'].shift(1)

# generate chunks
res = []
chunk = []

for i,row in df.iterrows():
    date = row['date']
    count = row['count']
    cut_point = row['cut_point']

    if cut_point == True:

        res.append(pd.DataFrame(chunk))

        del chunk[:]

        chunk.append({'date':date,'count':count})

    else:
        chunk.append({'date':date,'count':count})

res.append(pd.DataFrame(chunk))

print(res[0])

print(res[1])

Thanks for the insight. But this takes very long time when dataframe is large. — Chris, Jan 29 '19 at 01:39

score 1 · Answer 3 · answered May 02 '22 at 19:57

For those of you googling this like me who may be working with a non-datetime column and you want to split the Dataframe based on irregular gaps in that column I've adapted anky 's answer above to a more general form:

Where:

'stepCol' is the name of the column you are looking for irregular gaps in
stepResolution is the regular step value you expect, or the smallest step size that can occur.
minStepsToSplitOn is the threshold where you would like to split. It should probably be '2' for most cases, but for example if you would like to allow gaps in 'stepCol' of up to 60 seconds, where your smallest step size is 10 seconds, then you would set stepResolution to '10' and minStepsToSplitOn to '7'.

stepResolution = 60 
minStepsToSplitOn = 2
df['identifier'] = (~subDf['stepCol'].diff().div(stepResolution, fill_value=0).lt(minStepsToSplitOn)).cumsum()

You can then simply use df.groupby('identifier') to split wherever there are irregular steps in your chosen column.

Chunking DataFrame by gaps in datetime index

3 Answers3