339

I have a Pandas DataFrame with a 'date' column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.

What is the best way to achieve this?

Catija
  • 359
  • 6
  • 21
AMM
  • 17,130
  • 24
  • 65
  • 77

16 Answers16

457

If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.

For example:

df.loc['2014-01-01':'2014-02-01']

See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection

If the column is not the index you have two choices:

  1. Make it the index (either temporarily or permanently if it's time-series data)
  2. df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]

See here for the general explanation

Note: .ix is deprecated.

Foreever
  • 7,099
  • 8
  • 53
  • 55
Retozi
  • 7,521
  • 1
  • 23
  • 16
  • 4
    Thank you, will read. The date is a seperate column and not the index in my case. I should have probably given that information in the first place. MY question was not very informative. – AMM Apr 06 '14 at 19:35
  • updated my answer to account for date column filtering – Retozi Apr 06 '14 at 19:45
  • 79
    You can use `query` here as well. `df.query('20130101 < date < 20130201')`. – Phillip Cloud Apr 06 '14 at 19:56
  • 2
    Same as: http://stackoverflow.com/questions/16341367/grabbing-selection-between-specific-dates-in-a-dataframe Which is also useful. – Union find Dec 31 '14 at 20:03
  • 16
    You should mention that the filters for index (via `.loc` and `.ix`) and columns in your examples are not equivalent. `df.ix['2014-01-01':'2014-02-01']` includes `2014-02-01` while `df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]` does not include `2013-02-01`, it will only match rows up to `2013-01-31`. – Rafael Barbosa Jul 26 '16 at 12:57
  • 7
    What if one doesn't want to filter on a date range, but on multiple datetimes ? – Salem Jul 06 '18 at 08:42
  • Will this correctly compare a datetime object against a string? – Addem Aug 19 '18 at 05:13
  • @Phillip Cloud I would like use your solution, but: `df.query('20130101 < time < 20130201')` and `df.query('2013-01-01 < time < 2013-02-01')` throw `TypeError: Cannot compare type 'Period' with type 'int'` (or string). Tried `df.query(pd.to_datetime('2013-01-01') < 'time' < pd.to_datetime('2013-02-01'))`but `TypeError: Cannot compare type 'Timestamp' with type 'str'`. My df['time'] is a `to_datetime`object, formatted as `'%Y%m%d'`. Any suggestion ? – user305883 Mar 16 '19 at 11:50
  • Figure it out: my error was because I stripped date object to only keep date formatting, without time. So it was no more a to_datetime object. That is, my df[time] was constructed as `pd.to_datetime(..).dt.date` instead of simple `pd.to_datetime(..)` – user305883 Mar 16 '19 at 12:02
  • 1
    I get this: TypeError: '<' not supported between instances of 'datetime.date' and 'str' – MarMat Jul 01 '20 at 12:25
  • How can I set the filter more precise; such as from '2013-01-01 16:53:22:' onwards? – Baobab Mar 09 '22 at 14:11
  • Please be aware that by using loc it means you are expecting the exact min and max dates to exists as values. This solution won't work if you're filtering by dates that might not exist exactly in the df – Crispy Holiday May 11 '22 at 02:55
  • Make sure the index is sorted (`sort_index(inplace=True)`) otherwise you may get `KeyError: 'Value based partial slicing on non-monotonic DatetimeIndexes with non-existing keys is not allowed.'` – AXO Aug 26 '23 at 02:47
90

Previous answer is not correct in my experience, you can't pass it a simple string, needs to be a datetime object. So:

import datetime 
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]
wordsforthewise
  • 13,746
  • 5
  • 87
  • 117
orange1
  • 2,871
  • 3
  • 32
  • 58
70

And if your dates are standardized by importing datetime package, you can simply use:

df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]  

For standarding your date string using datetime package, you can use this function:

import datetime
datetime.datetime.strptime
shm2008
  • 1,393
  • 1
  • 11
  • 10
52

If you have already converted the string to a date format using pd.to_datetime you can just use:

df = df[(df['Date'] > "2018-01-01") & (df['Date'] < "2019-07-01")]

ah bon
  • 9,293
  • 12
  • 65
  • 148
Jerin Mathew
  • 670
  • 5
  • 6
37

The shortest way to filter your dataframe by date: Lets suppose your date column is type of datetime64[ns]

# filter by single day
df_filtered = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']

# filter by single month
df_filtered = df[df['date'].dt.strftime('%Y-%m') == '2014-01']

# filter by single year
df_filtered = df[df['date'].dt.strftime('%Y') == '2014']
Ekrem Gurdal
  • 1,118
  • 13
  • 14
32

If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:

from datetime import date

import pandas as pd

value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]
VMAtm
  • 27,943
  • 17
  • 79
  • 125
24

If the dates are in the index then simply:

df['20160101':'20160301']

Edit: while short, this style is now deprecated (at least as of pandas 1.5.3) and the recommended style is df.loc['20160101':'20160301'] (as in other answers)

fantabolous
  • 21,470
  • 7
  • 54
  • 51
20

You can use pd.Timestamp to perform a query and a local reference

import pandas as pd
import numpy as np

df = pd.DataFrame()
ts = pd.Timestamp

df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')

print(df)
print(df.query('date > @ts("20190515T071320")')

with the output

                 date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25


                 date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25

Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing @ prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string

danielhrisca
  • 665
  • 1
  • 5
  • 11
  • Could you pass a link for documentation for @ts functions? – Glen Moutrie May 26 '20 at 11:45
  • 1
    You may not need `pd.TimeStamp` here. `df.query('date > 20190515071320')` seems to work fine. – ChaimG Oct 26 '21 at 13:54
  • yes, @ChaimG it works fine! Not needed using pd.TimeStamp. Pandas query() is very flexible: Accepts integers `df.query('date > 20190515071320')` or string `df.query('date > "20190515071320"')`. ....Do you want easy to read: `df.query('date > "2019-05-15 07:13:20"')` – A.Sommerh Nov 06 '22 at 21:07
  • With "T"? ok: ............................................................ `df.query('date > "20190515T071320"')` `df.query('date > "2022-11-06 T 20:32:00"')` – A.Sommerh Nov 06 '22 at 21:13
  • Only date without time part? Fine:..................... `df.query('date > "20190515"')` `df.query('date > "2019-05-15"')` `df.query('date > 20190515')` # Even as date integer don't get confused with integer with date+time. Smart enough! – A.Sommerh Nov 06 '22 at 21:16
13

I'm not allowed to write any comments yet, so I'll write an answer, if somebody will read all of them and reach this one.

If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:

df.loc[df.index.month == 3]

That will filter the dataset for you by March.

uhetz
  • 382
  • 3
  • 11
11

So when loading the csv data file, we'll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().

If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:

import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost'] 

This has been tested working for Python 3.7. Hope you will find this useful.

Harry
  • 1,147
  • 13
  • 13
  • 1
    `index_col` has to be a `string` not a list. `mydata = pd.read_csv('mydata.csv',index_col='date')` – ANUBIS Apr 16 '20 at 14:13
5

You could just select the time range by doing: df.loc['start_date':'end_date']

petezurich
  • 9,280
  • 9
  • 43
  • 57
5
import pandas as pd

STEP 1: convert the date column into a pandas datetime using pd.to_datetime()

df['date']=pd.to_datetime(df["date"],unit='s')

STEP 2: perform the filtering in any predetermined manner ( i.e 2 months)

df = df[(df["date"] >"2022-03-01" & df["date"] < "2022-05-03")]
fantabolous
  • 21,470
  • 7
  • 54
  • 51
Martin Maati
  • 121
  • 1
  • 5
4

How about using pyjanitor

It has cool features.

After pip install pyjanitor

import janitor

df_filtered = df.filter_date(your_date_column_name, start_date, end_date)
aminography
  • 21,986
  • 13
  • 70
  • 74
pakira79
  • 111
  • 2
  • 5
  • ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). – mah65 Apr 13 '21 at 09:27
4

In pandas version 1.1.3 I encountered a situation where the python datetime based index was in descending order. In this case

df.loc['2021-08-01':'2021-08-31']

returned empty. Whereas

df.loc['2021-08-31':'2021-08-01']

returned the expected data.

Paul
  • 7,155
  • 8
  • 41
  • 40
3

Another solution if you would like to use the .query() method.

It allows you to use write readable code like .query(f"{start} < MyDate < {end}") on the trade off, that .query() parses strings and the columns values must be in pandas date format (so that it is also understandable for .query())

df = pd.DataFrame({
     'MyValue': [1,2,3],
     'MyDate': pd.to_datetime(['2021-01-01','2021-01-02','2021-01-03'])
})
start = datetime.date(2021,1,1).strftime('%Y%m%d')
end = datetime.date(2021,1,3).strftime('%Y%m%d')
df.query(f"{start} < MyDate < {end}")

(following the comment from @Phillip Cloud, answer from @Retozi)

Marcel Flygare
  • 837
  • 10
  • 19
1
# 60 days from today
after_60d = pd.to_datetime('today').date() + datetime.timedelta(days=60)
# filter date col less than 60 days date
df[df['date_col'] < after_60d]
DSBLR
  • 555
  • 5
  • 9