I am trying to write a regular expression to catch different format of dates.
The sentences are in a series and each sample of the series contains only one date, but may have other numbers.
The format of dates is like this:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
For years that only have two digits we assume it is a 20th century year (i.e. 19nn)
Here is my regular expression:
df_dates = df.str.extract(r'((?:\d{1,2})?[-/\s,]{0,2}(?:\d{1,2})?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[-/\s,]{0,2}(?:19|20)?\d{2})')
My regex produces these results:
input1
Lab: B12 969 2007\n
found1
12,969
input2
Contemplating jumping off building - 1973 - difficulty writing paper.\n
found2
1973
Question
How do I change my regex to obtain the desired results?