0

I have very long time series for which i need to set values within intervals of certain events to np.nan. measures is a datetimeindexed dataframe and events is a distinct datetimeindex distinct.

Measures looks like:

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | 0.107815 |
| 1970-01-01 00:06:24 | 0.046558 |
| 1970-01-01 00:06:25 | 0.056558 |
| 1970-01-01 00:08:12 | 0.837405 |

For example if there were only a single event at timestamp 1970-01-01 00:06:21 and the intervals for removing values was +/- 5 seconds, the output would be:

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | np.nan   |
| 1970-01-01 00:06:24 | np.nan   |
| 1970-01-01 00:06:25 | np.nan   |
| 1970-01-01 00:08:12 | 0.837405 |

Currently I'm interating over the events using .loc:

for i in range(events.shape[0]):
    measures.loc[events[i] - pd.Timedelta("4min"):\
                 events[i] + pd.Timedelta("1min") \
        ] = np.nan

Now this works but takes too long both dataframes are large (events : 10k rows, measures 1.5m rows). Therefore I couldn't construct a boolean index like so:

measure_index = measures.index.to_numpy()
left_bounds = (events - pd.Timedelta("4min")).to_numpy()
right_bounds = (events + pd.Timedelta("1min")).to_numpy()
# The following product wouldn't fit in memory even with boolean dtype.
left_bool_array = measure_index >= left_bounds.reshape((-1,1)) 
right_bool_array = measure_index <= right_bounds.reshape((-1,1))
mask = np.sum( left_bool_array & right_bool_array.T ,axis= 0) 

Left joining the events on measures or reindexing events is also out of the question as they take too long.

I then ran into pd.intervalindex:

left_bound = events - pd.Timedelta("4min")
right_bound = events + pd.Timedelta("1min")
interval_index=pd.IntervalIndex.from_arrays(left_bound,right_bound)

Intervalindex index has .contains() method which takes a scalar and returns "a boolean mask whether the value is contained in the Intervals". However for my use case I'd need to loop trough the measures frame and sum the boolean array for each row. I'm looking for a method like so:

pandas.IntervalIndex.intersect(input: array_like) -> boolean_array (same shape as input)

With each element in the output representing whether the corresponding input value is in any of the intervals.

Similar but different questions:

Edit performance of options discussed in below answer:

len(events) = 10000, len(measures) = 1525229

  • pandas .loc : 10.5 seconds
for _ in range(10):  
    left_bound = dilution_copy.index - pd.Timedelta("4min")
    right_bound = dilution_copy.index + pd.Timedelta("1min")

    for left,right in zip(left_bound,right_bound):
        measure_copy.loc[left:right]=np.nan
  • Staircase : 13.9 seconds
for _ in range(10):  
    sf = sc.Stairs(start=measure_copy.index, end = measure_copy.index[1:], value=measure_copy.values)
    mask = sc.Stairs(start=dilution_copy.index-pd.Timedelta('4 min'), end=dilution_copy.index+pd.Timedelta('1 min'))
    masked = sf.mask(mask)
    result = masked.sample(measure_copy.index, include_index=True)
  • Bisect + .iloc: 35.1 seconds
for _ in range(10):
    left_bound = dilution_copy.index - pd.Timedelta("4min")
    right_bound = dilution_copy.index + pd.Timedelta("1min")

    for left,right in zip(left_bound,right_bound):
        measure_copy.iloc[bisect(measure_copy.index, left):bisect(measure_copy.index, right)]=np.nan
kubatucka
  • 555
  • 5
  • 15
  • Can you provide a simplified (shortened) version of `measures`, `events` and the expected final output? – Mortz Oct 13 '21 at 07:40
  • @Mortz Just did, but the question remains how to do it efficiently for large dataframes. – kubatucka Oct 13 '21 at 08:12
  • 1) work on timestamps instead of datetimes. 2) work with numpy arrays instead of dataframes. 3) do this simultaneous iterating over events and measures (which you mentioned in comment below) 4) use numba or other compiler to do this iteration faster (as it will be bottleneck for sure) – dankal444 Oct 13 '21 at 20:51

3 Answers3

1

If your measures data is already sorted (or if it is not too time consuming to sort once) - you can think of using bisect.

Here is an approximate more complete solution:

  • Check every element of events for where it can be "inserted" in the measures
  • Check if the timestamps on either side of this "insertion point" are within 5 seconds
  • If yes, set to nan
def bisect_loop():
    for event in events:
        bisect_point = bisect.bisect(measures.index, event)
        keep_looking_lower = True
        while keep_looking_lower:
            lower_side_index = max(0, bisect_point - 1)
            lower_side_diff = event - measures.index[lower_side_index]
            if lower_side_diff.seconds < 5:
                measures.loc[measures.index[lower_side_index]] = np.nan
                bisect_point = max(0, bisect_point - 1)
            elif lower_side_diff.seconds >=5 or bisect_point == 0:
                keep_looking_lower = False
        keep_looking_higher = True
        while keep_looking_higher:
            higher_side_index = min(len(measures.index), bisect_point)
            higher_side_diff = event - measures.index[higher_side_index]
            if higher_side_diff.seconds < 5:
                measures.loc[measures.index[higher_side_index]] = np.nan
                bisect_point = min(len(measures.index), bisect_point + 1)
            elif higher_side_diff.seconds >=5 or bisect_point == len(measures.index):
                keep_looking_higher = False

Here are some stats on a dummy dataset with 150 measures and 10 events -

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(1), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})

timestamps = pd.to_datetime(df)
measures = pd.concat([timestamps, pd.Series(np.random.rand(150))], axis=1)
measures = measures.set_index(0)
measures = measures.sort_index()

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(24), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})
events = pd.to_datetime(df).sample(10).reset_index(drop=True)

%timeit op_loop() # This is your loc based approach that is working
8.74 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit bisect_loop()
3.22 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Mortz
  • 4,654
  • 1
  • 19
  • 35
  • This would work if we removed only 1 point on each side. But i'm trying to remove all point within that interval. Sorry my question wasn't clear enough, I'll update the example. – kubatucka Oct 13 '21 at 12:10
  • Yes, I understand. That's why I said it is an approximate solution. You could scan the adjacent elements to the `bisection_point` with a `while` loop and stop once you find the points which are more than 5s off – Mortz Oct 13 '21 at 12:27
  • Don't you think running a while loop on each side would be slower than .loc[lower_bound:higher_bound] ? The approximate frequency of measures is 5 seconds. That's 60 iterations per event. But I'll try it. – kubatucka Oct 13 '21 at 12:31
  • My sense is that if an `event` has to change a lot of `measures` to a `nan`, then yes - I would expect it to be slow. But as you say, if the approximate frequency of measures is ~5s, then each event will have to only look at a couple of `while` iterations. – Mortz Oct 13 '21 at 12:36
  • 1
    Better would be to find the index of the upper and lower bound with bisect and set the values in that index range to np.nan, I think. – kubatucka Oct 13 '21 at 13:43
  • Yet this is still O(interval * logN * len(events)). As both indexes are sorted, it feels like it should be possible to do it in O(len(measures)) maybe by iterating trough both indexes at the same time. – kubatucka Oct 13 '21 at 13:54
0

Can be solved with staircase, a package built on pandas and numpy for working with (mathematical) step functions.

Setup:

measures = pd.Series(
    [0.471331, 0.069177, 0.955357, 0.107815, 0.046558, 0.056558, 0.837405],
    index = pd.DatetimeIndex([
        pd.Timestamp("1970-1-1 00:00:15"),
        pd.Timestamp("1970-1-1 00:02:37"),
        pd.Timestamp("1970-1-1 00:03:59"),
        pd.Timestamp("1970-1-1 00:06:17"),
        pd.Timestamp("1970-1-1 00:06:24"),
        pd.Timestamp("1970-1-1 00:06:25"),
        pd.Timestamp("1970-1-1 00:08:12"),
    ])
)

events = pd.DatetimeIndex(['1970-01-01 00:06:21'])

Solution:

import pandas as pd
import staircase as sc

sf = sc.Stairs(start=measures.index, end = measures.index[1:], value=measures.values)
mask = sc.Stairs(start=events-pd.Timedelta('5 seconds'), end=events+pd.Timedelta('5 seconds'))
masked = sf.mask(mask)
result = masked.sample(measures.index, include_index=True)

Why it works

1st line: create a step function which is made up of intervals whose endpoints are the index of measures. The last interval, which starts at 1970-01-01 00:08:12, has no end point and will be of infinite length

2nd line: create a step function where the times in your events variable are the centers of intervals in the step function, and the endpoints are +/- 5 seconds from the center. No problem at all if any intervals overlap.

3rd line: mask the first step function with the second step function, which sets values in the first step function to NaN whenever the second step function is non-zero

4th line: evaluates the masked step function at your event times

note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Riley
  • 2,153
  • 1
  • 6
  • 16
  • This could work but having glanced at the source codes i highly doubt it would be faster. It has multiple concatenations and iterates trough both series more than once. – kubatucka Oct 19 '21 at 08:34
  • It'd be interesting to know what the timings are if you can test it on the full example. The internals are mostly numpy and pandas vectorised functions - no "for" loops. – Riley Oct 19 '21 at 09:39
0

One option is with the conditional_join from pyjanitor.

Reusing @Riley's data:

# pip install pyjanitor
import pandas as pd
import janitor

df1 = measures.rename('measure').reset_index()
df2 = pd.DataFrame(events)
df2 = (df2.assign(start = df2[0] -pd.Timedelta('5 seconds'), 
                  end = df2[0] +pd.Timedelta('5 seconds'))
          .drop(columns=0)
       )
(
df1
.conditional_join(
    df2, 
    ('index', 'start', '>='), 
    ('index', 'end', '<='), 
    how = 'left')
.assign(measure = lambda df: np.where(df.start.isna(), 
                                      df.measure, 
                                      np.nan))
.loc[:, ['index', 'measure']]
)

                index   measure
0 1970-01-01 00:00:15  0.471331
1 1970-01-01 00:02:37  0.069177
2 1970-01-01 00:03:59  0.955357
3 1970-01-01 00:06:17       NaN
4 1970-01-01 00:06:24       NaN
5 1970-01-01 00:06:25       NaN
6 1970-01-01 00:08:12  0.837405
sammywemmy
  • 27,093
  • 4
  • 17
  • 31