16

I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?

Example of Dataframe with No Columns but a Date Index:

In [29]: import pandas as pd

In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])

In [31]: ts = pd.DataFrame(index=dates)

As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?

Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.

Noah Gift
  • 256
  • 1
  • 4
  • 9
  • Can you add some sample of data ? Please check [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – jezrael Oct 18 '16 at 21:05
  • Sure, will add right now. – Noah Gift Oct 18 '16 at 21:06

2 Answers2

15

Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object

from datetime import datetime, timedelta
import pandas as pd

# Construct dummy dataframe
dates = pd.to_datetime([
    '2016-08-03',
    '2016-08-04',
    '2016-08-05',
    '2016-08-17',
    '2016-09-05',
    '2016-09-06',
    '2016-09-07',
    '2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])

# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]

# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]

# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
    gap_start = df['date'][i - 1]
    print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
          f'Duration: {str(g.to_pytimedelta())}')
Addison Klinke
  • 1,024
  • 2
  • 14
  • 23
  • If I used the column 'date' as the index in the df, how could the deltas be estimated? The function pd.index.difference() does sth different. – Ilias Machairas Apr 09 '20 at 13:29
6

here's something to get started:

df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df =  df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()

The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:

print df.head()

                     ones
2016-08-03 00:00:00     0
2016-08-03 10:53:39     1
2016-08-04 00:00:00     1
2016-08-05 00:00:00     1
2016-08-06 00:00:00     1

print df.tail()
                     ones
2016-09-16 00:00:00     4
2016-09-17 00:00:00     4
2016-09-18 00:00:00     4
2016-09-19 00:00:00     4
2016-09-19 10:23:03     5

now to complete:

df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()

which gives:

              first_time  gaps
ones                          
0    2016-08-03 00:00:00     1
1    2016-08-03 10:53:39    34
2    2016-09-05 11:10:46     1
3    2016-09-05 11:11:30     2
4    2016-09-06 10:53:39    14
5    2016-09-19 10:23:03     1
Dickster
  • 2,969
  • 3
  • 23
  • 29
  • this looks like it is off to a good start. I got lost after cumsum though. At this point: `In [11]: df = df.astype(int) In [12]: df.head() Out[12]: ones 2016-08-03 00:00:00 0 2016-08-03 10:53:39 1 2016-08-04 00:00:00 0 2016-08-05 00:00:00 0 2016-08-06 00:00:00 0` – Noah Gift Oct 19 '16 at 13:24
  • The key step is to get a grouping variable, that is what the cumsum is for. – Dickster Oct 19 '16 at 13:29
  • Can you explain how I would take that grouping variable and detect first gap and first consecutive streak? – Noah Gift Oct 19 '16 at 13:30
  • Has the solution so far shown the gaps properly? We can solve the consecutive streaks later. – Dickster Oct 19 '16 at 21:36
  • almost can you label the gaps, "gaps" and then I will give you the answer. – Noah Gift Oct 19 '16 at 21:56
  • is the answer ok? – Dickster Oct 22 '16 at 17:43
  • I am new to answering questions on stackoverflow so apologies about how slow I am. The gaps are labeled, but I don't understand the time dimension. I would expect it too look more like this: 2016-08-03 10:53:39 0 2016-09-05 11:10:46 32 i.e. 32 days (off the top of my head), between 8/03 - 09/05 and then potentially another column like this: first_time streaks ones 0 2016-08-03 00:00:00 0 1 2016-08-03 10:53:39 1 As in each additional day adds to streak.. – Noah Gift Oct 24 '16 at 02:27
  • best to place your desired output into your question in a formatted style datatable. I think there are 2 effects you are trying to achieve, let's square away each in turn, first the gaps and then the specifics about 2 day ranges. – Dickster Oct 27 '16 at 22:35