1

How can I optimise the following operation:

df[(df.start <= x) & (df.end >= y)]

I tried using MultiIndex but saw no significant speedup.

df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]

Sample data:

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>start</th>\n      <th>end</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2018-11-13 10:28:30.304287</td>\n      <td>2018-11-13 10:46:28.663868</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2018-11-13 12:27:32.226550</td>\n      <td>2018-11-13 13:09:02.723869</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2018-11-13 13:29:29.981659</td>\n      <td>2018-11-13 13:54:01.138963</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>2018-11-13 14:30:49.380554</td>\n      <td>2018-11-13 14:48:50.627830</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>2018-11-13 14:59:26.799017</td>\n      <td>2018-11-13 15:24:00.453983</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>2018-11-13 16:30:16.824188</td>\n      <td>2018-11-13 16:48:35.346318</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>2018-11-13 17:15:25.486287</td>\n      <td>2018-11-13 17:59:30.774629</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>2018-11-13 18:27:41.915379</td>\n      <td>2018-11-13 18:47:26.528320</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>2018-11-13 19:28:12.835576</td>\n      <td>2018-11-13 19:52:15.448146</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>2018-11-13 20:41:41.210849</td>\n      <td>2018-11-13 21:07:52.249831</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>2018-11-13 21:11:23.529623</td>\n      <td>2018-11-13 21:42:10.106951</td>\n    </tr>\n  </tbody>\n</table>'
Lokesh
  • 2,842
  • 7
  • 32
  • 47

1 Answers1

1

The bottleneck is construction of the Boolean series / array used for indexing.

Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

# boundaries for testing
mindt = pd.to_datetime('2016-01-01') 
maxdt = pd.to_datetime('2017-01-01')

x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# check results are the same
assert np.array_equal(x, y)

%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop

%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop

Setup

np.random.seed(0)

def random_dates(start, end, n):
    start_u = start.value//10**9
    end_u = end.value//10**9
    cols = ['start', 'end']
    df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
    df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
    df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
    return df

# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Great improvement. But I think the main problem is caused by the fact both `start` and `index` are unindexed and take `O(n)` time. If someone we could index both of them it would be a lot faster. What do you think? – Lokesh Nov 23 '18 at 12:29
  • @Lokesh, O(*n*) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with `numba` if this isn't sufficient. – jpp Nov 23 '18 at 12:30
  • Are you sure `O(n)` is unavoidable because I saw considerable improvement after I moved a column to `DateTimeIndex` and then queried it using `df[date_time_object:]`? Earlier I was doing `df[df.time > date_time_object]` and it was painfully slow. – Lokesh Nov 23 '18 at 12:32
  • @Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size *n* to a scalar, you *must* iterate every value in *n* once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(*n*) too in specific instances too, which is another problem with this logic.] – jpp Nov 23 '18 at 12:35
  • But if we use binary search, comparison would reduce to `O(log n)` and this is what I suspect `DateTimeIndex` do but I'm not sure. – Lokesh Nov 23 '18 at 12:36
  • OK, so what you're suggesting works **only if your inputs are sorted**, which isn't an assumption you've stated. If your inputs aren't sorted, you have to iterate all elements. That's unavoidable. If they are sorted, I suggest you ask a new question stating your special case and what you've tried in terms of optimization. – jpp Nov 23 '18 at 12:37
  • No my inputs aren't sorted. What I meant was that we can sort it ourselves. – Lokesh Nov 23 '18 at 12:44
  • @Lokesh, Sorting will have complexity greater than O(*n*), usually O(*n* log *n*). But that's another question. – jpp Nov 23 '18 at 12:45
  • Yeah but I need to make this query like 100,000 times, so in the long run its not bad. – Lokesh Nov 23 '18 at 12:46
  • @Lokesh, Sure: unfortunately, you haven't mentioned any of these aspects in your original question. So [ask a new question stating your specific requirements](https://stackoverflow.com/questions/ask). – jpp Nov 23 '18 at 12:47
  • I ran an experiment to see if `DateTimeIndex` uses binary search or not. It turns out it doesn't and takes the same `O(n)`. If that's the case, then its no point asking a new question. It would be too much of a work to build a data structure outselves that would do binary search in date. I accept this answer. Many thanks! You rock. – Lokesh Nov 23 '18 at 12:53