I am working on a single-column pandas data frame consists of thousands (rows) of string expression. Each string may contain "date" information of different formats, for instance:
05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015
To use a couple of strings as examples:
df[0] he plans to depart on 6/12/95
df[1] as of Mar. 23rd, 2011, the board decides that...
df[2] the 12-28-01 record shows...
I would like to use a findall() function after df, such that df.str.findall(r'') extracts date elements:
[0] 6/12/95
[1] Mar. 23rd, 2011
[2] 12-28-01
from the original strings, followed by some 'sort' command line to sort the extracted dates in chronological order by their indices, so that the output should look like
[0] 1
[1] 3
[2] 2
I (tentatively) use the following function
df.str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')
but have no clue as to how to deal with
(1) ordinal indicator after digits: st, th, nd
(2) the occasional "." values representing abbreviation, and
(3) slash (/) and hyphen (-)
using regex final function in one go.
Also, after all the extraction works are done, I want to sort them in chronological order with their respective indices (i.e., 1, 2, 3,..., n). But my currently knowledge of regex is insufficient to know how Python is able to sort these different date format in chronological order.
It will be really appreciated if someone could enlighten me with some handy tricks on the .findall() function for this or explain the mechanisms for sorting date expressions.