10

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:

 6  0.047033
 7  0.047650
 8  0.054067
 9  0.064767
10  0.073183
11  0.077950

I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this:

 [(150, 185), (632, 680), (1500,1870)]

Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive.

I started by filtering for only values above 0.5 like so

 df = df[df['values'] >= 0.5]

And now I have values like this:

632  0.545700
633  0.574983
634  0.572083
635  0.595500
636  0.632033
637  0.657617
638  0.643300
639  0.646283

I can't show my actual dataset, but the following one should be a good representation

import numpy as np
from pandas import *

np.random.seed(seed=901212)

df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35

yielding:

 1  0.491233
 2  0.538596
 3  0.516740
 4  0.381134
 5  0.670157
 6  0.846366
 7  0.495554
 8  0.436044
 9  0.695597
10  0.826591
...

Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.

tlnagy
  • 3,274
  • 4
  • 24
  • 37

2 Answers2

24

You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:

# tag rows based on the threshold
df['tag'] = df['values'] > .5

# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]

# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]

# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]

so for example the first region would be:

>>> i, j = pr[0]
>>> df.loc[i:j]
    indices    values   tag
15       16  0.639992  True
16       17  0.593427  True
17       18  0.810888  True
18       19  0.596243  True
19       20  0.812684  True
20       21  0.617945  True
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
1

I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.

import numpy as np

# from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
# with minor edits
def contiguous_regions(condition):
    """Finds contiguous True regions of the boolean array "condition". Returns
    a 2D array where the first column is the start index of the region and the
    second column is the end index."""

    # Find the indicies of changes in "condition"
    d = np.diff(condition,n=1, axis=0)
    idx, _ = d.nonzero() 

    # We need to start things after the change in "condition". Therefore, 
    # we'll shift the index by 1 to the right. -JK
    # LB this copy to increment is horrible but I get 
    # ValueError: output array is read-only without it 

    mutable_idx = np.array(idx)
    mutable_idx +=  1
    idx = mutable_idx

    if condition[0]:
        # If the start of condition is True prepend a 0
        idx = np.r_[0, idx]

    if condition[-1]:
        # If the end of condition is True, append the length of the array
        idx = np.r_[idx, condition.size] # Edit

    # Reshape the result into two columns
    idx.shape = (-1,2)
    return idx

def main():
    import pandas as pd
    RUN_LENGTH_THRESHOLD = 5
    VALUE_THRESHOLD = 0.5

    np.random.seed(seed=901212)
    data = np.random.rand(500)*.5 + .35

    df = pd.DataFrame(data=data,columns=['values'])

    match_bools =  df.values > VALUE_THRESHOLD 


    print('with boolian array')
    for start, stop in contiguous_regions(match_bools):
        if (stop - start > RUN_LENGTH_THRESHOLD):
            print (start, stop)



if __name__ == '__main__':
    main()

I would be surprised if there were not more elegant ways

Community
  • 1
  • 1
  • Thanks, but I prefer a predominantly pandas-based version if possible. A NumPy version may be preferable for performance reasons though. – tlnagy Jun 18 '14 at 10:54
  • Pandas is written with performance in mind. I would no be surprised if the solution from @behzad.nourisolution was as fast. [IPython's %timeit magic](http://pynash.org/2013/03/06/timing-and-profiling.html) can be useful for quick profiling – Laurence Billingham Jun 18 '14 at 15:18