efficient filtering by date range in pandas

Question

A large dataframe has a date column. By using pandas.read_csv(..., parse_dates=["date"]) to read the data, I assume the column has been converted to an efficient data type for representing dates.

The task is now to select all items that fall into a date range, e.g. ("2018-01-01", "2018-12-31"). This could be extremely fast by having the date column in sorted form and using binary search to locate the bounding indices.

But how do I tell this to pandas? Is it enough to sort by the column and perform a query on it? Should I make it a pandas.DateTimeIndex and use .loc?

One possible caveat is that the items already have a MultiIndex that needs to be kept intact. Also, I don't want more than one copy of the dataframe in memory.

Have you tried with `numpy` and `searchsorted`? Take a look at https://stackoverflow.com/questions/28031346/search-for-elements-by-timestamp-in-a-sorted-pandas-dataframe — Nicolás Ozimica, Mar 18 '19 at 18:15
@Nicolás I will try. But I'm also curious whether it is really necessary to go to numpy. — clstaudt, Mar 18 '19 at 18:26
Use a Boolean Series with `.loc`, checking if `df['date'].between("2018-01-01", "2018-12-31")`, or really just if `df.date.dt.year==2018` — ALollz, Mar 18 '19 at 18:28
@clstaudt Nupy supports Pandas in many ways, so it's not so strange to use `numpy` straight away — Nicolás Ozimica, Mar 18 '19 at 18:28
@Nicolás I agree it is not strange. Yet I am hoping to learn something about pandas DataFrame here: Is there a way to set up the DataFrame so that `df dot get me the items from start_date to end_date` is efficient. Using numpy.searchsorted is doable but not straightforward enough. — clstaudt, Mar 18 '19 at 18:40
No, a Boolen mask is ~O(n). This post has some useful information: https://stackoverflow.com/questions/45240803/pandas-dataframe-search-is-linear-time-or-constant-time — ALollz, Mar 18 '19 at 18:52
@ALollz What you are suggesting is the status quo that I am trying to beat. It's not as fast as it could be. — clstaudt, Mar 18 '19 at 18:55
@ALollz Not so sure that the accepted answer in https://stackoverflow.com/questions/45240803/pandas-dataframe-search-is-linear-time-or-constant-time is useful. Heavy abuse of O-notation and the timing experiments do not support his point, dataframe not big enough probably. — clstaudt, Mar 18 '19 at 19:07

efficient filtering by date range in pandas

0 Answers0