4

I'm trying to write a function that takes a continuous time series and returns a data structure which describes any missing gaps in the data (e.g. a DF with columns 'start' and 'end'). It seems like a fairly common issue for time series, but despite messing around with groupby, diff, and the like -- and exploring SO -- I haven't been able to come up with much better than the below.

It's a priority for me that this use vectorized operations to remain efficient. There has got to be a more obvious solution using vectorized operations -- hasn't there? Thanks for any help, folks.

import pandas as pd


def get_gaps(series):
    """
    @param series: a continuous time series of data with the index's freq set
    @return: a series where the index is the start of gaps, and the values are
         the ends
    """
    missing = series.isnull()
    different_from_last = missing.diff()

    # any row not missing while the last was is a gap end        
    gap_ends = series[~missing & different_from_last].index

    # count the start as different from the last
    different_from_last[0] = True

    # any row missing while the last wasn't is a gap start
    gap_starts = series[missing & different_from_last].index        

    # check and remedy if series ends with missing data
    if len(gap_starts) > len(gap_ends):
         gap_ends = gap_ends.append(series.index[-1:] + series.index.freq)

    return pd.Series(index=gap_starts, data=gap_ends)

For the record, Pandas==0.13.1, Numpy==1.8.1, Python 2.7

jwilner
  • 6,348
  • 6
  • 35
  • 47

1 Answers1

2

This problem can be transformed to find the continuous numbers in a list. find all the indices where the series is null, and if a run of (3,4,5,6) are all null, you only need to extract the start and end (3,6)

import numpy as np
import pandas as pd
from operator import itemgetter
from itertools import groupby


# create an example 
data = [2, 3, 4, 5, 12, 13, 14, 15, 16, 17]
s = pd.series( data, index=data)
s = s.reindex(xrange(18))
print find_gap(s)  


def find_gap(s): 
    """ just treat it as a list
    """ 
    nullindex = np.where( s.isnull())[0]
    ranges = []
    for k, g in groupby(enumerate(nullindex), lambda (i,x):i-x):
        group = map(itemgetter(1), g)
        ranges.append((group[0], group[-1]))
    startgap, endgap = zip(* ranges) 
    return pd.series( endgap, index= startgap )

reference : Identify groups of continuous numbers in a list

Community
  • 1
  • 1
fast tooth
  • 2,317
  • 4
  • 25
  • 34
  • Thanks -- you're definitely right that there are great itertools, etc. solutions for this. I should've been clearer though that I'm really hoping to keep this on the numpy level to afford as much optimization as possible. Also, I'm not sure this is really a cleaner solution than the above. – jwilner Jul 18 '14 at 13:20
  • 1
    maybe you want to check out this solution :http://stackoverflow.com/questions/7352684/how-to-find-the-groups-of-consecutive-elements-from-an-array-in-numpy the one by unutbu – fast tooth Jul 18 '14 at 13:31