Check if value in pandas dataframe is within any two values of two other columns in another dataframe

Question

I have two dataframes of different length. dfSamples (63012375 rows) and dfFixations (200000 rows).

dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})  
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})

I would like to check each value in dfSamples if it is within any two ranges given in dfFixations and then assign a label to this value. I have found this: Check if value in a dataframe is between two values in another dataframe, but the loop solution is terribly slow and I cannot make any other solution work.

Working (but very slow) example:

labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
for i, fixation in dfFix.iterrows():
    log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
    labels[log_range] = 'fixation'
labels[labels != 'fixation'] = 'no_fixation'
dfSamples['labels'] = labels

Following this example: Performance of Pandas apply vs np.vectorize to create new column from existing columns I have tried to vectorize this but with no success.

def check_range(samples, tstart, tend):
    log_range = (samples > tstart) & (samples < tend)
    return log_range
fixations = list(map(check_range, dfSamples['tSample'], dfFix['tStart'], dfFix['tEnd']))

Would appreciate any help!

jezrael · Accepted Answer · 2021-11-04T12:41:46.790

Use IntervalIndex.from_arrays with IntervalIndex.get_indexer , if not match is returned -1, so checked and set ouput in numpy.where:

i = pd.IntervalIndex.from_arrays(dfFixations['tStart'],
                                 dfFixations['tEnd'], 
                                 closed="both")
pos = i.get_indexer(dfSamples['tSample'])
dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")

print (dfSamples)
   tSample       labels
0        4     fixation
1        6     fixation
2        8     fixation
3       10  no_fixation
4       12     fixation
5       14     fixation

Performance: In ideal nice sorted not overlap data, in real should be performance different, the best test it.

dfSamples = pd.DataFrame({'tSample':range(10000)})  
dfFixations = pd.DataFrame({'tStart':range(0, 10000, 5),'tEnd':range(2, 10000, 5)})
    


In [165]: %%timeit
     ...: labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
     ...: for i, fixation in dfFixations.iterrows():
     ...:     log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
     ...:     labels[log_range] = 'fixation'
     ...: labels[labels != 'fixation'] = 'no_fixation'
     ...: dfSamples['labels'] = labels
     ...: 
     ...: 
1.25 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [168]: %%timeit
     ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: dfSamples["labels1"] =  np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")
     ...: 
315 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [170]: %%timeit
     ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
     ...: dfSamples["labels1"] = np.where(contained, "fixation", "no_fixation")
     ...: 
82.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [166]: %%timeit
     ...: s = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: pos = s.get_indexer(dfSamples['tSample'])
     ...: dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")
     ...: 
27.8 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Great answer! For my data set, this is 10x faster than my solution. Thank you! — heldm, Nov 09 '21 at 10:24
Hi, I am getting the error message "InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique" even though my intervals are not overlapping. If I use `[ii.overlaps(x) for x in ii]` only the diagonal is True — heldm, Nov 09 '21 at 13:17
@heldm - hmmm, maybe there is at least one overlapped value. Another solution also not working? — jezrael, Nov 09 '21 at 13:22

Riley · Answer 2 · 2022-06-21T03:55:25.640

setup

dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})  
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})

solution

Create an interval index from your start and end points

ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")

ii.contains is a method which checks if a point is contained by each interval in the interval index, eg

dfSamples["tSample"].apply(ii.contains)

gives

0     [True, False]
1     [True, False]
2     [True, False]
3    [False, False]
4     [False, True]
5     [False, True]
Name: tSample, dtype: object

We're going to take this result, apply any function to each element (a list) to get a pandas.Series of booleans, which we can then use with numpy.where

dfSamples["labels"] =  np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")

the result

   tSample       labels
0        4     fixation
1        6     fixation
2        8  no_fixation
3       10  no_fixation
4       12     fixation
5       14  no_fixation

edit: faster version

Using piso v0.6.0

import piso
import numpy as np

ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
dfSamples["labels"] = np.where(contained, "fixation", "no_fixation")

This will run in a similar time to @jezrael's solution, however it can handle cases where intervals overlaps eg

dfFixations = pd.DataFrame({'tStart':[4,5,12],'tEnd':[8,9,14]})

note: I am the creator of piso. Please feel free to reach out with feedback or questions if you have any.

Thanks for your answer! Unfortunately, this seems to be even slower than interrows for me. Took about 5x longer than my solution above — heldm, Nov 03 '21 at 14:30
I may have a faster solution for you but I need to know how your intervals are "closed". In your first solution your intervals are closed on both ends, and in your second they're closed on neither. — Riley, Nov 03 '21 at 20:37
Would be great! Whatever is fastest will be best in this case. Preferably, only 10 would get the label 'no_fixation'. — heldm, Nov 04 '21 at 08:40

Check if value in pandas dataframe is within any two values of two other columns in another dataframe

2 Answers2