Python Pandas: drop rows of a timeserie based on time range

Question

I have the following timeserie:

start = pd.to_datetime('2016-1-1')
end = pd.to_datetime('2016-1-15')
rng = pd.date_range(start, end, freq='2h')
df = pd.DataFrame({'timestamp': rng, 'values': np.random.randint(0,100,len(rng))})  
df = df.set_index(['timestamp'])

I would like to drop the rows that are between those 2 timestamps:

start_remove = pd.to_datetime('2016-1-4')
end_remove = pd.to_datetime('2016-1-8')

How can I do that?

piRSquared · Accepted Answer · 2017-01-06T19:59:16.203

27

using query

df.query('index < @start_remove or index > @end_remove')

using loc

df.loc[(df.index < start_remove) | (df.index > end_remove)]

using date slicing

This includes the end points

pd.concat([df[:start_remove], df[end_remove:]])

And without the end points

pd.concat([df[:start_remove], df[end_remove:]]).drop([start_remove, end_remove])

edited Jan 06 '17 at 19:59

answered Jan 06 '17 at 19:56

piRSquared

285,575
57
475
624

Thank you that's good. Do you think there is a short syntax to do that with boolean indexing as well? – jim jarnac Jan 06 '17 at 19:59
Outstanding, learning so much! actually query and loc are better since it is easier to decide whether to include or not start and end point. What does the pipe symbol stands for? (i could google it but its very difficult to get good answers on single symbol question) – jim jarnac Jan 06 '17 at 20:03
1

`|` is the binary or operator that pandas hijacks to do element wise logical operations. – piRSquared Jan 06 '17 at 20:04
Ok so it basically is the same as `&` ? Or am i getting it wrong? – jim jarnac Jan 06 '17 at 20:07
1

Yes, except it's "or" not "and". – piRSquared Jan 06 '17 at 20:08

score 6 · Answer 2 · answered Jan 11 '19 at 14:05

6

df = df.drop(pd.date_range('2018-01-01', '2018-02-01')), errors='ignore')

answered Jan 11 '19 at 14:05

Joe Heffer

379
5
6

nipy · Answer 3 · 2017-01-06T21:31:02.773

4

Another one to try. Exclude the dates in the date_range:

Edit: Added frequency to date_range. This is now the same as original data.

dropThis = pd.date_range(start_remove,end_remove,freq='2h')
df[~df.index.isin(dropThis)]

We can see the rows are now dropped.

len(df)
169

len(df[~pd.to_datetime(df.index).isin(dropThis)])
120

edited Jan 06 '17 at 21:31

answered Jan 06 '17 at 20:42

nipy

5,138
5
31
72

Interesting; what does the symbol `~` stands for? – jim jarnac Jan 06 '17 at 20:43
Please see this post: http://stackoverflow.com/questions/8305199/the-tilde-operator-in-python – nipy Jan 06 '17 at 20:47
1

Please try again now. I added a frequency string to the date range. – nipy Jan 06 '17 at 21:31

EdChum · Answer 4 · 2017-01-06T23:22:51.403

An obscure method is to use slice_indexer on your index by passing your start and end range, this will return a Slice object which you can use to index into your original index and then negate the values using isin:

In [20]:
df.loc[~df.index.isin(df.index[df.index.slice_indexer(start_remove, end_remove)])]

Out[20]:
                     values
timestamp                  
2016-01-01 00:00:00       0
2016-01-01 02:00:00      57
2016-01-01 04:00:00      98
2016-01-01 06:00:00      82
2016-01-01 08:00:00      24
2016-01-01 10:00:00       1
2016-01-01 12:00:00      41
2016-01-01 14:00:00      14
2016-01-01 16:00:00      40
2016-01-01 18:00:00      48
2016-01-01 20:00:00      77
2016-01-01 22:00:00      34
2016-01-02 00:00:00      88
2016-01-02 02:00:00      58
2016-01-02 04:00:00      72
2016-01-02 06:00:00      24
2016-01-02 08:00:00      32
2016-01-02 10:00:00      44
2016-01-02 12:00:00      57
2016-01-02 14:00:00      88
2016-01-02 16:00:00      97
2016-01-02 18:00:00      75
2016-01-02 20:00:00      46
2016-01-02 22:00:00      31
2016-01-03 00:00:00      60
2016-01-03 02:00:00      73
2016-01-03 04:00:00      79
2016-01-03 06:00:00      71
2016-01-03 08:00:00      53
2016-01-03 10:00:00      70
...                     ...
2016-01-12 14:00:00       5
2016-01-12 16:00:00      42
2016-01-12 18:00:00      17
2016-01-12 20:00:00      94
2016-01-12 22:00:00      63
2016-01-13 00:00:00      63
2016-01-13 02:00:00      50
2016-01-13 04:00:00      44
2016-01-13 06:00:00      35
2016-01-13 08:00:00      59
2016-01-13 10:00:00      53
2016-01-13 12:00:00      16
2016-01-13 14:00:00      68
2016-01-13 16:00:00      66
2016-01-13 18:00:00      56
2016-01-13 20:00:00      18
2016-01-13 22:00:00      59
2016-01-14 00:00:00       8
2016-01-14 02:00:00      60
2016-01-14 04:00:00      52
2016-01-14 06:00:00      87
2016-01-14 08:00:00      31
2016-01-14 10:00:00      91
2016-01-14 12:00:00      64
2016-01-14 14:00:00      53
2016-01-14 16:00:00      47
2016-01-14 18:00:00      87
2016-01-14 20:00:00      47
2016-01-14 22:00:00      27
2016-01-15 00:00:00      28

[120 rows x 1 columns]

Here you can see that 49 rows were removed from the original df

In [23]:
df.index.slice_indexer(start_remove, end_remove)

Out[23]:
slice(36, 85, None)

In [24]:
df.index.isin(df.index[df.index.slice_indexer(start_remove, end_remove)])

Out[24]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False, False, False, False,
       ........
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False], dtype=bool)

and then invert the above using ~

Edit Actually you can achieve this without isin:

df.loc[df.index.difference(df.index[df.index.slice_indexer(start_remove, end_remove)])]

will also work.

Timings

Interestingly this is also the fastest method:

In [30]:
%timeit df.loc[df.index.difference(df.index[df.index.slice_indexer(start_remove, end_remove)])]

100 loops, best of 3: 4.05 ms per loop

In [31]:    
%timeit df.query('index < @start_remove or index > @end_remove')

10 loops, best of 3: 15.2 ms per loop

In [32]:    
%timeit df.loc[(df.index < start_remove) | (df.index > end_remove)]

100 loops, best of 3: 4.94 ms per loop

@jimbasquiat yes it uses `isin` but the difference here is that we don't generate any new index we query the existing index with the start and stop range which returns a `Slice` object which is just something we can use to index back so it would be more efficient than Adele's answer for large datasets but it's not terribly readable compared to PiRSquared's answer — EdChum, Jan 06 '17 at 21:36
Please note any solution by EdChum or @piRSquared is to be trusted more than my novice attempts :-) — nipy, Jan 06 '17 at 21:37
@adele I think the main problem with creating a `date_range` is that it supposes that the index freq is consistent and that there are no missing values which would not be an issue for piRSquared or my answer but it works fine here, still +1 though — EdChum, Jan 06 '17 at 21:39

Python Pandas: drop rows of a timeserie based on time range

4 Answers4

Linked