0

I am having problems filtering a pandas dataframe - I want to filter all the dates in the "Date" column to return only dates in the last month, but python is mixing up the date and month and returning the wrong answer. Could you let me know how to fix thanks

import pandas as pd
import numpy as np
import datetime as dt
df = pd.DataFrame(pd.read_csv("Dates.csv"))
tday = dt.date.today()
tdelta = dt.timedelta(days=-30)
dt = tday + tdelta
dt = np.datetime64(dt)
print(dt)
df['Date'] = pd.to_datetime(df['Date'])
df = df.loc[df['Date'] >= dt]
print(df)

Dates.csv df is:

DF = 
     Met By        Date
0    David   06/07/2020
1    Philip  22/06/2020
2   Richard  02/04/2020
3   Richard  09/03/2020
4   Richard  17/02/2020
5    Philip  12/02/2020
6    Philip  29/01/2020
7      Jane  20/01/2020
8      Jane  13/01/2020
9   Richard  10/01/2020
10   Philip  06/01/2020

UPDATE: solved by VALDI - must include dayfirst = True on read_csv

import pandas as pd
df = pd.read_csv("Dates.csv", parse_dates=[1], dayfirst = True)
df = df[df['Date'] > pd.Timestamp.today() - pd.Timedelta('30D')]
print(df)
David A
  • 141
  • 2
  • 10
  • You can refer this answer https://stackoverflow.com/questions/47642201/how-can-i-sort-dataframe-by-date-in-python – Venkat Ramana Aug 28 '20 at 16:29
  • 3
    Does this answer your question? [How can I sort DataFrame by date in Python?](https://stackoverflow.com/questions/47642201/how-can-i-sort-dataframe-by-date-in-python) – Filip Aug 28 '20 at 18:40

2 Answers2

2

Let's start from the way how you read your DataFrame:

df = pd.DataFrame(pd.read_csv("Dates.csv"))

Note that:

  • pd.read_csv already returns a DataFrame,
  • so there is no need to create another DataFrame from the first one.

A simpler approach is: df = pd.read_csv("Dates.csv").

But this is not all. If you have a column containing a date then convert it to datetime type as early as when your read the DateFrame, so, assuming that your file contains only Met By and Date columns (no index column), the proper formula to read is:

df = pd.read_csv("Dates.csv", parse_dates=[1])

And now how to filter your DataFrame:

The first hint is not to use datetime module, as Pandas has its native today and Timedelta functions. As Date column is now of proper (datetime) type, you don't need any conversions. Just use:

df[df.Date > pd.Timestamp.today() - pd.Timedelta('30D')]

If you have also future dates and want to filter them out, run:

df[df.Date > (pd.Timestamp.today() - pd.Timedelta('30D'))
    and df.Date < pd.Timestamp.today()]
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • hi i am using your code: df = pd.read_csv("Dates.csv", parse_dates=[1]) df = df[df['Instruction Date'] > pd.Timestamp.today() - pd.Timedelta('30D')] - but it is still returning "3 Richard 2020-09-03 5 Philip 2020-12-02 9 Richard 2020-10-01" as values - when it should not return any. lmk if there is a way to fix this thanks – David A Aug 28 '20 at 20:00
  • also if i add this i get same problem - df['Instruction Date'] = pd.to_datetime(df['Instruction Date'], format='%Y/%m/%d') – David A Aug 28 '20 at 20:02
  • Your data sample contains only **past** dates, so I didn't think about **future** dates. Now I corrected this detail. – Valdi_Bo Aug 28 '20 at 20:16
  • Hi Valdi - this is not a future date - it is returning me "3 Richard 09/03/2020" - the third row in the data sample - and it is swapping the date with the month - which means it is not returning the right responses. Do you know how to fix this? Thanks – David A Aug 28 '20 at 20:31
  • Pass also *dayfirst=True* to *read_csv*. As I remember, the default value is *False*, so day / month are swapped. – Valdi_Bo Aug 28 '20 at 20:34
0

You can use pandas's to_timedelta function and datetime's datetime.now function on date column like

It seems you need to reformat date first

df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')     

then

import datetime
import pandas as pd 


df_filtered = df[df.Date > datetime.datetime.now() - pd.to_timedelta("30day")]

As suggested by @Henry, you can use pd.Timestamp.now() as well

df_filtered = df[df.Date > pd.Timestamp.now() - pd.to_timedelta("30day")]
A.B
  • 20,110
  • 3
  • 37
  • 71
  • 1
    Or just use `pd.Timestamp.now()`. – Henry Yik Aug 28 '20 at 16:41
  • @HenryYik, great, thanks for reminding, added to the answer – A.B Aug 28 '20 at 16:45
  • hi still have same problem after below - python still confusing date with month – David A Aug 28 '20 at 17:05
  • can you eloborate more about what do you mean by confusing betwwen date with month, may be put desired and current output – A.B Aug 28 '20 at 17:07
  • say timestamp is : 2020-07-29, python will return dates including 2020-09-03 - however this is supposed to be ninth of march, not third of september – David A Aug 28 '20 at 17:15
  • I have updated my answer, can you try to reformat first? @DavidA – A.B Aug 28 '20 at 17:31
  • hi have done this - df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y') but it still makes no difference. Doesnt make any sense. I found your linkedin on your profile, if you like I can add you, send you full code, then give you endorsement when you figure it out? thanks – David A Aug 28 '20 at 17:45
  • I just executed you code and it works fine, you just dont have any date greater than 29/07 which is one month delta – A.B Aug 28 '20 at 18:09
  • (i dont get any row in filtered_df as all are before one month delta) – A.B Aug 28 '20 at 20:05