I have very long time series for which i need to set values within intervals of certain events to np.nan
. measures
is a datetimeindex
ed dataframe and events
is a distinct datetimeindex
distinct.
Measures looks like:
| index | measure |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | 0.107815 |
| 1970-01-01 00:06:24 | 0.046558 |
| 1970-01-01 00:06:25 | 0.056558 |
| 1970-01-01 00:08:12 | 0.837405 |
For example if there were only a single event at timestamp 1970-01-01 00:06:21
and the intervals for removing values was +/- 5 seconds, the output would be:
| index | measure |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | np.nan |
| 1970-01-01 00:06:24 | np.nan |
| 1970-01-01 00:06:25 | np.nan |
| 1970-01-01 00:08:12 | 0.837405 |
Currently I'm interating over the events using .loc
:
for i in range(events.shape[0]):
measures.loc[events[i] - pd.Timedelta("4min"):\
events[i] + pd.Timedelta("1min") \
] = np.nan
Now this works but takes too long both dataframes are large (events : 10k rows, measures 1.5m rows). Therefore I couldn't construct a boolean index like so:
measure_index = measures.index.to_numpy()
left_bounds = (events - pd.Timedelta("4min")).to_numpy()
right_bounds = (events + pd.Timedelta("1min")).to_numpy()
# The following product wouldn't fit in memory even with boolean dtype.
left_bool_array = measure_index >= left_bounds.reshape((-1,1))
right_bool_array = measure_index <= right_bounds.reshape((-1,1))
mask = np.sum( left_bool_array & right_bool_array.T ,axis= 0)
Left joining the events on measures or reindexing events is also out of the question as they take too long.
I then ran into pd.intervalindex:
left_bound = events - pd.Timedelta("4min")
right_bound = events + pd.Timedelta("1min")
interval_index=pd.IntervalIndex.from_arrays(left_bound,right_bound)
Intervalindex
index has .contains()
method which takes a scalar and returns "a boolean mask whether the value is contained in the Intervals". However for my use case I'd need to loop trough the measures frame and sum the boolean array for each row. I'm looking for a method like so:
pandas.IntervalIndex.intersect(input: array_like) -> boolean_array (same shape as input)
With each element in the output representing whether the corresponding input value is in any of the intervals.
Similar but different questions:
Interval lookup with interval index: Fastest way to merge pandas dataframe on ranges
Quite similar but the suggested solutions (merges) are not applicable Match IntervalIndex as part of a MultiIndex
If only I had the same indexes and a single interval per row to lookup Best way to join / merge by range in pandas
Edit performance of options discussed in below answer:
len(events) = 10000, len(measures) = 1525229
- pandas
.loc
: 10.5 seconds
for _ in range(10):
left_bound = dilution_copy.index - pd.Timedelta("4min")
right_bound = dilution_copy.index + pd.Timedelta("1min")
for left,right in zip(left_bound,right_bound):
measure_copy.loc[left:right]=np.nan
- Staircase : 13.9 seconds
for _ in range(10):
sf = sc.Stairs(start=measure_copy.index, end = measure_copy.index[1:], value=measure_copy.values)
mask = sc.Stairs(start=dilution_copy.index-pd.Timedelta('4 min'), end=dilution_copy.index+pd.Timedelta('1 min'))
masked = sf.mask(mask)
result = masked.sample(measure_copy.index, include_index=True)
- Bisect +
.iloc
: 35.1 seconds
for _ in range(10):
left_bound = dilution_copy.index - pd.Timedelta("4min")
right_bound = dilution_copy.index + pd.Timedelta("1min")
for left,right in zip(left_bound,right_bound):
measure_copy.iloc[bisect(measure_copy.index, left):bisect(measure_copy.index, right)]=np.nan