3

I'm migrating some code from python list primitive to a pandas implementation. For some time series, I want to find all the discontinuous segments with their durations. Is there a clean way of doing this in pandas?

My dataframe looks like this:

In [23]: df
Out[23]:
2016-07-01 05:35:00    60.466667
2016-07-01 05:40:00          NaN
2016-07-01 05:45:00          NaN
2016-07-01 05:50:00          NaN
2016-07-01 05:55:00          NaN
2016-07-01 06:00:00          NaN
2016-07-01 06:05:00          NaN
2016-07-01 06:10:00          NaN
2016-07-01 06:15:00          NaN
2016-07-01 06:20:00          NaN
2016-07-01 06:25:00          NaN
2016-07-01 06:30:00          NaN
2016-07-01 06:35:00          NaN
2016-07-01 06:40:00          NaN
2016-07-01 06:45:00          NaN
2016-07-01 06:50:00          NaN
2016-07-01 06:55:00          NaN
2016-07-01 07:00:00          NaN
2016-07-01 07:05:00          NaN
2016-07-01 07:10:00          NaN
2016-07-01 07:15:00          NaN
2016-07-01 07:20:00          NaN
2016-07-01 07:25:00          NaN
2016-07-01 07:30:00          NaN
2016-07-01 07:35:00          NaN
2016-07-01 07:40:00          NaN
2016-07-01 07:45:00    63.500000
2016-07-01 07:50:00    67.293333
2016-07-01 07:55:00    67.633333
2016-07-01 08:00:00    68.306667
                         ...
2016-07-01 11:20:00          NaN
2016-07-01 11:25:00          NaN
2016-07-01 11:30:00    62.000000
2016-07-01 11:35:00    69.513333
2016-07-01 11:40:00    64.931298
2016-07-01 11:45:00    51.980000
2016-07-01 11:50:00    55.253333
2016-07-01 11:55:00    51.273333
2016-07-01 12:00:00    52.080000
2016-07-01 12:05:00    54.580000
2016-07-01 12:10:00    55.306667
2016-07-01 12:15:00    55.200000
2016-07-01 12:20:00    57.140000
2016-07-01 12:25:00    57.020000
2016-07-01 12:30:00    57.526667
2016-07-01 12:35:00    57.880000
2016-07-01 12:40:00    67.286667
2016-07-01 12:45:00    58.153333
2016-07-01 12:50:00    57.460000
2016-07-01 12:55:00    54.413333
2016-07-01 13:00:00    55.526667
2016-07-01 13:05:00    56.120000
2016-07-01 13:10:00    55.620000
2016-07-01 13:15:00    56.420000
2016-07-01 13:20:00    51.893333
2016-07-01 13:25:00    74.451613
2016-07-01 13:30:00    54.898551
2016-07-01 13:35:00          NaN
2016-07-01 13:40:00    63.355140
2016-07-01 13:45:00    61.000000
Freq: 5T, dtype: float64

Where, for example, the first discontinuous event is from 5:40 to 7:40.

leonsas
  • 4,718
  • 6
  • 43
  • 70

2 Answers2

6

This should work as long as you have a series or a single-column dataframe.

>>>pd.Series(df.isnull().index).diff()

Which can be improved to get a useful output with:

MIN_GAP_TIMEDELTA = Timedelta(minutes=30)
discontinuities = pd.Series(df.isnull().index).diff()
discontinuities.sort(ascending=False)
discontinuities[discontinuities > MIN_GAP_TIMEDELTA].size
ilmarinen
  • 4,557
  • 3
  • 16
  • 12
2

Not as elegant or short as pandas based solution(s), but with performance in mind, one can look to use NumPy arrays and functions. So, to solve such a case and assuming the date-times have a regular frequency, here's a NumPy based approach to get discontinuity lengths, max-lengths and thresholded counts -

# Get indices of start and stop indices of discontinuities signified by NaNs
idx = np.where(np.diff(np.hstack(([False],np.isnan(df[0]),[False]))))[0]

# Do differentiation on those indices which would give us the length of 
# intervals of discontinuities. These could be used in various ways.
discontinuity_lens = np.diff(idx.reshape(-1,2),axis=1)

# Max discontinuity length
discontinuity_maxlen = discontinuity_lens.max()

# Count of discontinuities that are greater than a threshold of 30 mins as
# listed with threshold parameter : MIN_GAP_TIMEDELTA = Timedelta(minutes=30)
# (in terms of steps that would be 6 because freq of input dataframe is 5 mins)
thresholded_count = (discontinuity_lens>=6).sum()

Please note this largely based on another NumPy solution to : Longest run/island of a number in Python.

Runtime test

I would be timing @ilmarinen's pandas based solution and the NumPy based approach posted earlier in this post on a large enough dataframe filled with random elements and randomly placed 50% NaNs.

Function definitions :

def thresholdedcount_pandas(df):
    MIN_GAP_TIMEDELTA = pd.Timedelta(minutes=30)
    discontinuities = df.dropna().reset_index()['index'].diff()
    return (discontinuities > MIN_GAP_TIMEDELTA).sum()

def thresholdedcount_numpy(df):
    idx = np.where(np.diff(np.hstack(([False],np.isnan(df[0]),[False]))))[0]
    nan_interval_lens = np.diff(idx.reshape(-1,2),axis=1)
    return (nan_interval_lens>=6).sum()

Timings :

In [325]: # Random dataframe with 5 min interval data and filled with 50% NaNs
     ...: rng = pd.date_range('1/1/2011', periods=10000, freq='5Min')
     ...: df = pd.DataFrame(np.random.randn(len(rng)), index=rng)
     ...: df[0][np.random.randint(0,df.shape[0],(int(df.shape[0]/2)))] = np.nan
     ...: 

In [326]: np.allclose(thresholdedcount_pandas(df),thresholdedcount_numpy(df))
Out[326]: True

In [327]: %timeit thresholdedcount_pandas(df)
100 loops, best of 3: 3 ms per loop

In [328]: %timeit thresholdedcount_numpy(df)
1000 loops, best of 3: 318 µs per loop
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Will look into this solution. Right now I rather go with a simpler solution as performance is not a huge issue for these since it's mostly on background jobs. – leonsas Jul 03 '16 at 16:36