2

A large dataframe has a date column. By using pandas.read_csv(..., parse_dates=["date"]) to read the data, I assume the column has been converted to an efficient data type for representing dates.

The task is now to select all items that fall into a date range, e.g. ("2018-01-01", "2018-12-31"). This could be extremely fast by having the date column in sorted form and using binary search to locate the bounding indices.

But how do I tell this to pandas? Is it enough to sort by the column and perform a query on it? Should I make it a pandas.DateTimeIndex and use .loc?

One possible caveat is that the items already have a MultiIndex that needs to be kept intact. Also, I don't want more than one copy of the dataframe in memory.

clstaudt
  • 21,436
  • 45
  • 156
  • 239
  • 1
    Have you tried with `numpy` and `searchsorted`? Take a look at https://stackoverflow.com/questions/28031346/search-for-elements-by-timestamp-in-a-sorted-pandas-dataframe – Nicolás Ozimica Mar 18 '19 at 18:15
  • @Nicolás I will try. But I'm also curious whether it is really necessary to go to numpy. – clstaudt Mar 18 '19 at 18:26
  • 2
    Use a Boolean Series with `.loc`, checking if `df['date'].between("2018-01-01", "2018-12-31")`, or really just if `df.date.dt.year==2018` – ALollz Mar 18 '19 at 18:28
  • @clstaudt Nupy supports Pandas in many ways, so it's not so strange to use `numpy` straight away – Nicolás Ozimica Mar 18 '19 at 18:28
  • @Nicolás I agree it is not strange. Yet I am hoping to learn something about pandas DataFrame here: Is there a way to set up the DataFrame so that `df dot get me the items from start_date to end_date` is efficient. Using numpy.searchsorted is doable but not straightforward enough. – clstaudt Mar 18 '19 at 18:40
  • @ALollz is that efficient, i.e. O(log(n)) and not O(n)? – clstaudt Mar 18 '19 at 18:41
  • 1
    No, a Boolen mask is ~O(n). This post has some useful information: https://stackoverflow.com/questions/45240803/pandas-dataframe-search-is-linear-time-or-constant-time – ALollz Mar 18 '19 at 18:52
  • @ALollz What you are suggesting is the status quo that I am trying to beat. It's not as fast as it could be. – clstaudt Mar 18 '19 at 18:55
  • @ALollz Not so sure that the accepted answer in https://stackoverflow.com/questions/45240803/pandas-dataframe-search-is-linear-time-or-constant-time is useful. Heavy abuse of O-notation and the timing experiments do not support his point, dataframe not big enough probably. – clstaudt Mar 18 '19 at 19:07

0 Answers0