18

I have the following timeserie:

start = pd.to_datetime('2016-1-1')
end = pd.to_datetime('2016-1-15')
rng = pd.date_range(start, end, freq='2h')
df = pd.DataFrame({'timestamp': rng, 'values': np.random.randint(0,100,len(rng))})  
df = df.set_index(['timestamp'])

I would like to drop the rows that are between those 2 timestamps:

start_remove = pd.to_datetime('2016-1-4')
end_remove = pd.to_datetime('2016-1-8')

How can I do that?

jim jarnac
  • 4,804
  • 11
  • 51
  • 88

4 Answers4

27

using query

df.query('index < @start_remove or index > @end_remove')

using loc

df.loc[(df.index < start_remove) | (df.index > end_remove)]

using date slicing

This includes the end points

pd.concat([df[:start_remove], df[end_remove:]])

And without the end points

pd.concat([df[:start_remove], df[end_remove:]]).drop([start_remove, end_remove])
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thank you that's good. Do you think there is a short syntax to do that with boolean indexing as well? – jim jarnac Jan 06 '17 at 19:59
  • Outstanding, learning so much! actually query and loc are better since it is easier to decide whether to include or not start and end point. What does the pipe symbol stands for? (i could google it but its very difficult to get good answers on single symbol question) – jim jarnac Jan 06 '17 at 20:03
  • 1
    `|` is the binary or operator that pandas hijacks to do element wise logical operations. – piRSquared Jan 06 '17 at 20:04
  • Ok so it basically is the same as `&` ? Or am i getting it wrong? – jim jarnac Jan 06 '17 at 20:07
  • 1
    Yes, except it's "or" not "and". – piRSquared Jan 06 '17 at 20:08
6
df = df.drop(pd.date_range('2018-01-01', '2018-02-01')), errors='ignore')
Joe Heffer
  • 379
  • 5
  • 6
4

Another one to try. Exclude the dates in the date_range:

Edit: Added frequency to date_range. This is now the same as original data.

dropThis = pd.date_range(start_remove,end_remove,freq='2h')
df[~df.index.isin(dropThis)]

We can see the rows are now dropped.

len(df)
169

len(df[~pd.to_datetime(df.index).isin(dropThis)])
120
nipy
  • 5,138
  • 5
  • 31
  • 72
3

An obscure method is to use slice_indexer on your index by passing your start and end range, this will return a Slice object which you can use to index into your original index and then negate the values using isin:

In [20]:
df.loc[~df.index.isin(df.index[df.index.slice_indexer(start_remove, end_remove)])]

Out[20]:
                     values
timestamp                  
2016-01-01 00:00:00       0
2016-01-01 02:00:00      57
2016-01-01 04:00:00      98
2016-01-01 06:00:00      82
2016-01-01 08:00:00      24
2016-01-01 10:00:00       1
2016-01-01 12:00:00      41
2016-01-01 14:00:00      14
2016-01-01 16:00:00      40
2016-01-01 18:00:00      48
2016-01-01 20:00:00      77
2016-01-01 22:00:00      34
2016-01-02 00:00:00      88
2016-01-02 02:00:00      58
2016-01-02 04:00:00      72
2016-01-02 06:00:00      24
2016-01-02 08:00:00      32
2016-01-02 10:00:00      44
2016-01-02 12:00:00      57
2016-01-02 14:00:00      88
2016-01-02 16:00:00      97
2016-01-02 18:00:00      75
2016-01-02 20:00:00      46
2016-01-02 22:00:00      31
2016-01-03 00:00:00      60
2016-01-03 02:00:00      73
2016-01-03 04:00:00      79
2016-01-03 06:00:00      71
2016-01-03 08:00:00      53
2016-01-03 10:00:00      70
...                     ...
2016-01-12 14:00:00       5
2016-01-12 16:00:00      42
2016-01-12 18:00:00      17
2016-01-12 20:00:00      94
2016-01-12 22:00:00      63
2016-01-13 00:00:00      63
2016-01-13 02:00:00      50
2016-01-13 04:00:00      44
2016-01-13 06:00:00      35
2016-01-13 08:00:00      59
2016-01-13 10:00:00      53
2016-01-13 12:00:00      16
2016-01-13 14:00:00      68
2016-01-13 16:00:00      66
2016-01-13 18:00:00      56
2016-01-13 20:00:00      18
2016-01-13 22:00:00      59
2016-01-14 00:00:00       8
2016-01-14 02:00:00      60
2016-01-14 04:00:00      52
2016-01-14 06:00:00      87
2016-01-14 08:00:00      31
2016-01-14 10:00:00      91
2016-01-14 12:00:00      64
2016-01-14 14:00:00      53
2016-01-14 16:00:00      47
2016-01-14 18:00:00      87
2016-01-14 20:00:00      47
2016-01-14 22:00:00      27
2016-01-15 00:00:00      28

[120 rows x 1 columns]

Here you can see that 49 rows were removed from the original df

In [23]:
df.index.slice_indexer(start_remove, end_remove)

Out[23]:
slice(36, 85, None)

In [24]:
df.index.isin(df.index[df.index.slice_indexer(start_remove, end_remove)])

Out[24]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False, False, False, False,
       ........
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False], dtype=bool)

and then invert the above using ~

Edit Actually you can achieve this without isin:

df.loc[df.index.difference(df.index[df.index.slice_indexer(start_remove, end_remove)])]

will also work.

Timings

Interestingly this is also the fastest method:

In [30]:
%timeit df.loc[df.index.difference(df.index[df.index.slice_indexer(start_remove, end_remove)])]

100 loops, best of 3: 4.05 ms per loop

In [31]:    
%timeit df.query('index < @start_remove or index > @end_remove')

10 loops, best of 3: 15.2 ms per loop

In [32]:    
%timeit df.loc[(df.index < start_remove) | (df.index > end_remove)]

100 loops, best of 3: 4.94 ms per loop
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • @jimbasquiat yes it uses `isin` but the difference here is that we don't generate any new index we query the existing index with the start and stop range which returns a `Slice` object which is just something we can use to index back so it would be more efficient than Adele's answer for large datasets but it's not terribly readable compared to PiRSquared's answer – EdChum Jan 06 '17 at 21:36
  • Please note any solution by EdChum or @piRSquared is to be trusted more than my novice attempts :-) – nipy Jan 06 '17 at 21:37
  • 1
    @adele I think the main problem with creating a `date_range` is that it supposes that the index freq is consistent and that there are no missing values which would not be an issue for piRSquared or my answer but it works fine here, still +1 though – EdChum Jan 06 '17 at 21:39