3

Asumme the following toy code:

import numpy as np
import pandas as pd
rng = pd.date_range('1/1/2011', periods=72, freq='H')
avec = np.random.rand(len(rng))
bvec = np.random.rand(len(rng))
df = pd.DataFrame({"A":avec,"B":bvec}, index=rng)

I can now select a part of the time interval with

df.loc["2011-01-02",:]

Is there a way to efficiently access the boolean mask that corresponds to the resulting slice, i.e:

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False], dtype=bool)

I have tried the suggestions in an earlier stackoverflow answer but df.index.date takes a very long time to run on my dataset...

jpp
  • 159,742
  • 34
  • 281
  • 339
brorfred
  • 492
  • 1
  • 4
  • 14

3 Answers3

8

If performance is important chain 2 boolean masks:

(df.index >= "2011-01-02") & (df.index < "2011-01-03")
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

IIUC, you can do this:

df.index.isin(df.loc["2011-01-02",:].index)

array([False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False,  True,  True,  True,
    True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False], dtype=bool)

However, @jezrael's answer is faster. I leave this up as an alternative solution. The only convenience that this provides is if you want to refer to your sliced dataframe by name, rather than by the date string, for instance:

# named slice of your original dataframe:
sliced_df = df.loc["2011-01-02",:]
# get boolean array:
df.index.isin(sliced_df.index)
sacuL
  • 49,704
  • 8
  • 81
  • 106
0

You can extract numpy representation of your index and compare with a np.datetime64 object:

import numpy as np
from datetime import datetime

(df.index.values >= np.datetime64(datetime.strptime("2011-01-02", '%Y-%m-%d'))) & \
(df.index.values < np.datetime64(datetime.strptime("2011-01-03", '%Y-%m-%d')))

Note on behaviour

The solution above is specific to the query in the question. As @Jeff points out, string representations of datetime use partial indexing. So using numpy should only be used in specific cases.

See pandas documentation on datetime indexing for more details.

Performance benchmarking

df = pd.concat([df]*1000)

%timeit (df.index >= "2011-01-02") & (df.index < "2011-01-03")

%timeit (df.index.values >= np.datetime64(datetime.strptime("2011-01-02", '%Y-%m-%d'))) & \
        (df.index.values < np.datetime64(datetime.strptime("2011-01-03", '%Y-%m-%d')))

assert ((df.index >= "2011-01-02") & (df.index < "2011-01-03") == \
        (df.index.values >= np.datetime64(datetime.strptime("2011-01-02", '%Y-%m-%d'))) & \
        (df.index.values < np.datetime64(datetime.strptime("2011-01-03", '%Y-%m-%d')))).all()

# 1.21 ms ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 527 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 3
    these are not the same generally ; string datetimes use partial indexing. offering up a numpy soln is often a special case and not recommended – Jeff Apr 08 '18 at 16:06
  • 2
    docs: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#indexing – Jeff Apr 08 '18 at 16:08
  • 1
    Thanks for the solution! It's definitely faster than jezreal's but the difference decrease a little bit on large datasets and I'm a bit worried about memory consumption. I wish there was a way to get the equivalent to df.index == "2011-01-02" with partial indexing... – brorfred Apr 08 '18 at 16:45