A large dataframe has a date
column. By using pandas.read_csv(..., parse_dates=["date"])
to read the data, I assume the column has been converted to an efficient data type for representing dates.
The task is now to select all items that fall into a date range, e.g. ("2018-01-01", "2018-12-31")
. This could be extremely fast by having the date
column in sorted form and using binary search to locate the bounding indices.
But how do I tell this to pandas? Is it enough to sort by the column and perform a query on it? Should I make it a pandas.DateTimeIndex
and use .loc
?
One possible caveat is that the items already have a MultiIndex
that needs to be kept intact. Also, I don't want more than one copy of the dataframe in memory.