15

Is there any nice way to validate that all items in a dataframe's column have a valid date format?

My date format is 11-Aug-2010.

I saw this generic answer, where:

try:
    datetime.datetime.strptime(date_text, '%Y-%m-%d')
except ValueError:
    raise ValueError("Incorrect data format, should be YYYY-MM-DD")

source: https://stackoverflow.com/a/16870699/1374488

But I assume that's not good (efficient) in my case.

I assume I have to modify the strings to be pandas dates first as mentioned here: Convert string date time to pandas datetime

I am new to the Python world, any ideas appreciated.

cs95
  • 379,657
  • 97
  • 704
  • 746
lukas_o
  • 3,776
  • 4
  • 34
  • 50

2 Answers2

22

(format borrowed from piRSquared's answer)

if pd.to_datetime(df['date'], format='%d-%b-%Y', errors='coerce').notnull().all():
    # do something 

This is the LYBL—"Look Before You Leap" approach. This will return True assuming all your date strings are valid - meaning they are all converted into actual pd.Timestamp objects. Invalid date strings are coerced to NaT, which is the datetime equivalent of NaN.

Alternatively,

try:
    pd.to_datetime(df['date'], format='%d-%b-%Y', errors='raise')
    # do something
except ValueError:
    pass

This is the EAFP—"Easier to Ask Forgiveness than Permission" approach, a ValueError is raised when invalid date strings are encountered.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • What if the `'date'` column contains, for example, the value `1`? This would not throw an error in `to_datetime` but probably isn't valid. The best way to get around this would be to specify `format` – asongtoruin Mar 22 '18 at 18:02
  • 1
    Thanks, I was looking exactly for the LYBL approach! Great! – lukas_o Mar 22 '18 at 18:27
  • What if any format is OK but all needed is to have the same format? – Royi Jul 07 '22 at 12:11
8

If you know your format, you can use boolean slicing

mask = pd.to_datetime(df.columns, format='%d-%b-%Y', errors='coerce').notna()
df.loc[:, mask]

Consider the dataframe df

df = pd.DataFrame(1, range(1), ['11-Aug-2010', 'August2010, I think', 1])
df

   11-Aug-2010  August2010, I think  1
0            1                    1  1

I can filter with

mask = pd.to_datetime(df.columns, format='%d-%b-%Y', errors='coerce').notna()
df.loc[:, mask]

   11-Aug-2010
0            1
piRSquared
  • 285,575
  • 57
  • 475
  • 624