Boolean mask from pandas datetime index using .loc accessor

Question

Asumme the following toy code:

import numpy as np
import pandas as pd
rng = pd.date_range('1/1/2011', periods=72, freq='H')
avec = np.random.rand(len(rng))
bvec = np.random.rand(len(rng))
df = pd.DataFrame({"A":avec,"B":bvec}, index=rng)

I can now select a part of the time interval with

df.loc["2011-01-02",:]

Is there a way to efficiently access the boolean mask that corresponds to the resulting slice, i.e:

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False], dtype=bool)

I have tried the suggestions in an earlier stackoverflow answer but df.index.date takes a very long time to run on my dataset...

score 8 · Accepted Answer · answered Apr 08 '18 at 15:44

8

If performance is important chain 2 boolean masks:

(df.index >= "2011-01-02") & (df.index < "2011-01-03")

answered Apr 08 '18 at 15:44

jezrael

822,522
95
1,334
1,252

I'm curious on why you say "if performance is important"? Do you think the syntax is unclear and could be better if performance wasn't important? – roganjosh Apr 08 '18 at 15:46
@roganjosh - I think compare solution with `isin`, but I have to test it for 100% sure :) – jezrael Apr 08 '18 at 15:46
It is indeed faster, I just timed it. – sacuL Apr 08 '18 at 15:47
Ah, I prefer your solution as it is! :) – roganjosh Apr 08 '18 at 15:47

sacuL · Answer 2 · 2018-04-08T16:28:39.913

IIUC, you can do this:

df.index.isin(df.loc["2011-01-02",:].index)

array([False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False,  True,  True,  True,
    True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False], dtype=bool)

However, @jezrael's answer is faster. I leave this up as an alternative solution. The only convenience that this provides is if you want to refer to your sliced dataframe by name, rather than by the date string, for instance:

# named slice of your original dataframe:
sliced_df = df.loc["2011-01-02",:]
# get boolean array:
df.index.isin(sliced_df.index)

jpp · Answer 3 · 2018-04-08T16:32:49.080

You can extract numpy representation of your index and compare with a np.datetime64 object:

import numpy as np
from datetime import datetime

(df.index.values >= np.datetime64(datetime.strptime("2011-01-02", '%Y-%m-%d'))) & \
(df.index.values < np.datetime64(datetime.strptime("2011-01-03", '%Y-%m-%d')))

Note on behaviour

The solution above is specific to the query in the question. As @Jeff points out, string representations of datetime use partial indexing. So using numpy should only be used in specific cases.

See pandas documentation on datetime indexing for more details.

Performance benchmarking

df = pd.concat([df]*1000)

%timeit (df.index >= "2011-01-02") & (df.index < "2011-01-03")

%timeit (df.index.values >= np.datetime64(datetime.strptime("2011-01-02", '%Y-%m-%d'))) & \
        (df.index.values < np.datetime64(datetime.strptime("2011-01-03", '%Y-%m-%d')))

assert ((df.index >= "2011-01-02") & (df.index < "2011-01-03") == \
        (df.index.values >= np.datetime64(datetime.strptime("2011-01-02", '%Y-%m-%d'))) & \
        (df.index.values < np.datetime64(datetime.strptime("2011-01-03", '%Y-%m-%d')))).all()

# 1.21 ms ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 527 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

these are not the same generally ; string datetimes use partial indexing. offering up a numpy soln is often a special case and not recommended — Jeff, Apr 08 '18 at 16:06
docs: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#indexing — Jeff, Apr 08 '18 at 16:08
Thanks for the solution! It's definitely faster than jezreal's but the difference decrease a little bit on large datasets and I'm a bit worried about memory consumption. I wish there was a way to get the equivalent to df.index == "2011-01-02" with partial indexing... — brorfred, Apr 08 '18 at 16:45

Boolean mask from pandas datetime index using .loc accessor

3 Answers3

Linked

Related