1

I have a dataframe df['Latest date'] as below

45      01.12.2015 - 13:35:23
46      01.12.2015 - 13:36:27
                ...          
4355    09.07.2018 - 07:54:22
4356    09.07.2018 - 07:52:03
4357    24.07.2018 - 16:00:36
4358    09.07.2018 - 07:53:19
4359    09.07.2018 - 11:40:22
4369    23.07.2018 - 09:13:34
4370    16.07.2018 - 13:02:28
4371    23.07.2018 - 09:14:13
4372    23.07.2018 - 09:14:44
4374    23.07.2018 - 09:15:35
4378    23.07.2018 - 09:15:58
4379    23.07.2018 - 09:16:24
4380    23.07.2018 - 09:16:41
4381    23.07.2018 - 09:17:16
4383    23.07.2018 - 09:17:53
4387    23.07.2018 - 09:18:28
4389    23.07.2018 - 09:19:25
4393    23.07.2018 - 09:20:08
4394    25.07.2018 - 14:25:20
4395    18.07.2018 - 10:10:58
4396    18.07.2018 - 10:10:18
4398    18.07.2018 - 10:10:42
4399    16.07.2018 - 11:09:21
4400    16.07.2018 - 11:27:23
4401    23.07.2018 - 09:21:16
4407    18.07.2018 - 10:11:15
4417    24.07.2018 - 11:56:14
4418    25.07.2018 - 11:05:11

I want to filter for the rows with date> 22/07/2018

I tried df=df[(df["Latest date"]>"22/07/2018")]

however the new df returns entries not necessarily after 22/07/2018. I wonder if it is not understanding the dates are in "dd/mm/yy" formats.

Candice
  • 199
  • 3
  • 13
  • 1
    You need to ue a `date` object, not a string. Right now you perform ordinal comparisons, so that means that `23/06/1984` is seen as greater. – Willem Van Onsem Jul 29 '18 at 11:40

1 Answers1

2

You need first parse dates to datetimes by to_datetime with parameter format, see also http://strftime.org/ if different formating:

df["Latest date"] = pd.to_datetime(df["Latest date"], format='%d.%m.%Y - %H:%M:%S')
#slowier solution without defined format
#df["Latest date"] = pd.to_datetime(df["Latest date"])

print (df["Latest date"].dtype)
datetime64[ns]

[30000 rows x 1 columns]
df = pd.concat([df] * 1000, ignore_index=True)

In [204]: %timeit df["Latest date1"] = pd.to_datetime(df["Latest date"], format='%d.%m.%Y - %H:%M:%S')
163 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [205]: %timeit df["Latest date2"] = pd.to_datetime(df["Latest date"])
5.09 s ± 55.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So now is possible filtering:

df=df[(df["Latest date"]>"22/07/2018")]
print (df)
             Latest date
4357 2018-07-24 16:00:36
4369 2018-07-23 09:13:34
4371 2018-07-23 09:14:13
4372 2018-07-23 09:14:44
4374 2018-07-23 09:15:35
4378 2018-07-23 09:15:58
4379 2018-07-23 09:16:24
4380 2018-07-23 09:16:41
4381 2018-07-23 09:17:16
4383 2018-07-23 09:17:53
4387 2018-07-23 09:18:28
4389 2018-07-23 09:19:25
4393 2018-07-23 09:20:08
4394 2018-07-25 14:25:20
4401 2018-07-23 09:21:16
4417 2018-07-24 11:56:14
4418 2018-07-25 11:05:11

EDIT:

Another possible solution if input data are in csv is use parse_dates parameter by position of column or by column name:

df = pd.read_csv(file, parse_dates=['Latest date']) 
#e.g. third column, pandas counts from 0
#df = pd.read_csv(file, parse_dates=[2]) 
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252