1

I realize my title is a bit confusing, but I think I can make it clearer if we proceed by example. What I want to do is a vectorized test to check if any of the values in a given series is contained in any of the intervals defined by a DataFrame object with a start and stop column.

Consider the series, valid, which is the column of a DataFrame called trials. Here is what trials Looks like:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 156 entries, 0 to 155
Data columns (total 3 columns):
start    156  non-null values
stop     156  non-null values
valid    156  non-null values
dtypes: bool(1), float64(2)

I have a separate DataFrame called 'blink`. It has three columns:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41 entries, 0 to 40
Data columns (total 3 columns):
tstart    41  non-null values
tstop     41  non-null values
dur       41  non-null values
dtypes: bool(1), float64(2)

The last column is not directly relevant: it's the duration of the eyeblik, i.e. the difference betwee tstop and tstart.

I would like to set each row of trials['valid'] to False if the interval between it's corresponding trials['start'] to trials['stop'] overlaps with any of the blink['tstart'] to blink['tstop'] intervals.

I could iterate through the rows and use np.arange along with the in operator to do this in a nested loop, but it literally takes hours (my actual data set is much larger than this dummy example). Is there a vectorized approach I could use? If not, is there a faster iteration-based approach?

If anything is unclear, I'll of course be happy to provide additional details.

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
Louis Thibault
  • 20,240
  • 25
  • 83
  • 152

2 Answers2

1

Your blink data

In [27]: blink = pd.DataFrame(dict(tstart = [0,10], tstop = [5,15]))

In [28]: blink_s = blink.stack()

In [29]: blink_s.index = [ "%s_%s" % (v,i) for i, v in blink_s.index ]

Construct a series of of the blink (kind of like pivoting), but we need new names

In [37]: blink_s
Out[37]: 
tstart_0     0
tstop_0      5
tstart_1    10
tstop_1     15

The trial data

In [30]: trial = pd.DataFrame(dict(start = [3,7,12],stop=[4,10,16]))

Tile the blink_s across rows of the trial

In [32]: blink_df = pd.DataFrame([ blink_s for i in trial.index ])

Join em up

In [33]: x = trial.join(blink_df)

In [34]: x
Out[34]: 
   start  stop  tstart_0  tstop_0  tstart_1  tstop_1
0      3     4         0        5        10       15
1      7    10         0        5        10       15
2     12    16         0        5        10       15

Your answer is then a vectorized boolean expression (this maybe be a long one, so you should programatically generate it, but its not that complicated to do that)

In [35]: x['valid'] = ((x.start>x.tstart_0) & (x.stop<=x.tstop_0)) | ((x.start>x.tstart_1)&(x.stop<=x.tstop_1))

In [36]: x
Out[36]: 
   start  stop  tstart_0  tstop_0  tstart_1  tstop_1  valid
0      3     4         0        5        10       15   True
1      7    10         0        5        10       15  False
2     12    16         0        5        10       15  False

This will work if you want to have float data as your tstart/tstop criteria. If you restrict the intervals to only int data, then the solution is a bit simplier, as instead of doing all this, you can just create a single series of the values that are included (like blink_s), and just do isin.

In essence you are flattening the blink frame to a series that you then can apply to each of the trial

Using Isin (and OP data):

Convert to int64 data

trial = pd.load('trials.pickle').reindex(columns=['start','stop']).astype('int64')
blink = pd.load('blink.pickle').astype('int64')

Add in a row that we know is ni the range

trial = trial.append(pd.Series(dict(start=1000,stop=1200)),ignore_index=True)

Construct the range of values which we want to test

selections = []
for r in blink.iterrows():
    e = r[1]
    selections.extend(list(np.arange(e['tstart'],e['tstop'])))
selections = pd.Series(selections)

Return true if the passed start/stop are in the selection range

def f(s):
    return s.isin(selections).all()
trial['valid'] = trial.apply(f,axis=1)

trial[trial.valid]

I inserted 1 row that I knew would pass, no other rows pass

     start  stop valid
156   1000  1200  True
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • this looks great! I just have one question though: where are the values in your very first line coming from? What's the [0, 10] and [5, 15] ? Are those timepoints? If so, doesn't that mean I can just replace them with corresponding the columns from my `trials` dataframe? – Louis Thibault Apr 07 '13 at 22:10
  • I just made them up! (this was before your example), ``blink = pd.DataFrame(dict(tstart = [0,10], tstop = [5,15]))`` I wasn't sure what your 'timepoints' looked like, and in particular if they are integer or not (my fist part of solution assumes they are float's, if you use isin they must be integers) – Jeff Apr 07 '13 at 23:08
  • answer to your send question is, yes, this is exactly what you should do – Jeff Apr 07 '13 at 23:09
  • thank you! I'm getting a result which I'm 99% sure is erroneous. Namely, the x['valid'] column is giving me entirely false values. casting everything as an int is definitely an acceptable solution (rounding is preferable to truncating), but I'm having trouble seeing how I would go about doing that. Would you mind showing me? Also, [here](http://paste.ubuntu.com/5688105/) is my attempt at following your example. Did I do something silly? – Louis Thibault Apr 08 '13 at 01:17
  • added an example using isin (and your data), none of your data passes; but I added a trial that does for testing – Jeff Apr 08 '13 at 12:02
0

Assume the size of trials is m and the size of blink is n.

First, sort the blink by tstart before you check each row in trials, merge the overlapped ones, it takes O(n log n), take a look at this

While you check whether a start/stop pair is valid or not, follow the following algorithm.

  1. Use the binary search to insert the start in the sorted tstarts
  2. Use the binary search to insert the stop in the sorted tstarts
  3. If there's any tstart between start and stop, return True
  4. Find the the one just before the start, find its tstop, if they overlap with the start/stop pair, return True
  5. Return False

Above algorithm might help you reduce the time complexity to check whether a start/stop pair overlaps with any pair in blink or not from O(n) to O(log n) where n is the length of blink

The time complexity will reduce from O(mn) to O(m log n) + O(n log n). If m >> log n and n is large, that might help you a lot.

Community
  • 1
  • 1
waitingkuo
  • 89,478
  • 28
  • 112
  • 118