3

I am working on a single-column pandas data frame consists of thousands (rows) of string expression. Each string may contain "date" information of different formats, for instance:

05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015

To use a couple of strings as examples:

df[0] he plans to depart on 6/12/95
df[1] as of Mar. 23rd, 2011, the board decides that...
df[2] the 12-28-01 record shows...

I would like to use a findall() function after df, such that df.str.findall(r'') extracts date elements:

[0] 6/12/95
[1] Mar. 23rd, 2011
[2] 12-28-01

from the original strings, followed by some 'sort' command line to sort the extracted dates in chronological order by their indices, so that the output should look like

[0] 1
[1] 3
[2] 2

I (tentatively) use the following function

df.str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')

but have no clue as to how to deal with

(1) ordinal indicator after digits: st, th, nd

(2) the occasional "." values representing abbreviation, and

(3) slash (/) and hyphen (-)

using regex final function in one go.

Also, after all the extraction works are done, I want to sort them in chronological order with their respective indices (i.e., 1, 2, 3,..., n). But my currently knowledge of regex is insufficient to know how Python is able to sort these different date format in chronological order.

It will be really appreciated if someone could enlighten me with some handy tricks on the .findall() function for this or explain the mechanisms for sorting date expressions.

Chris T.
  • 1,699
  • 7
  • 23
  • 45
  • Please give an example of what the expression should match and what it should capture –  Aug 04 '17 at 11:02
  • I'll add that to the edited text. Thanks for addressing this. – Chris T. Aug 04 '17 at 11:04
  • 1
    I think this is really hard, it's literally not possible to extract various date formats in one single .findall() command. I am wondering if it's possible to do this for each date format one by one, stack them together and, finally, convert them to the same format 'yyyy-mm-dd' using .to_datetime()? – Chris T. Aug 04 '17 at 16:12

3 Answers3

2

dateutil.parser.parse could help you to avoid regex - it's surelly a good thing to do here.

It basically takes a string and tries to parse it in datetime object and that's great because datetime can be sorted easily.

from dateutil.parser import parse

data = """05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015"""

# Parse data into list of strings
data = data.replace('\n', ';').split(';')

dates = []
for line in data:
    try:
        dates.append(parse(line))
    except TypeError:
        # it's not parsable
        pass

print list(sorted(dates))

Cutted output:

[datetime.datetime(2001, 2, 4, 0, 0), datetime.datetime(2001, 3, 12, 0, 0), datetime.datetime(2001, 3, 25, 0, 0), datetime.datetime(2001, 3, 25, 0, 0) ...]

As you can see you win on two points:

  1. It's really easy to sort datetime objects
  2. You don't have to trust any long & complex regex pattern to know if a string is a date, parse do it for you
Arount
  • 9,853
  • 1
  • 30
  • 43
  • Hi, thanks for your reply. I experimented your code on my editor, it returned an error message saying that 'line' is not parsable. Also, the desired output format should look those I listed in the middle of the question text. – Chris T. Aug 04 '17 at 15:45
  • could you gives me the error please (complete traceback)? – Arount Aug 04 '17 at 16:32
  • --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () 15 for line in data: 16 try: ---> 17 dates.append(parse(line)) 18 except TypeError: 19 # it's not parsable – Chris T. Aug 04 '17 at 16:35
  • I agree that dateparser.parse() might be a straightforward way to convert date time into human readable format, but the hardest part for me is to make this function to work on a thousands of rows of "strings" in a single-column pandas dataframe. – Chris T. Aug 04 '17 at 17:29
  • the complexity of the problem is proportional to the number of different formats you want to support. A good option might be to have a set of regexes that can pull the date string from the text, and then use this library to convert them to datetime objects so you can handle them more easily. – TallChuck Aug 04 '17 at 17:35
0

I would try using the two following two modules. dateutil in this answer:

Extracting date from a string in Python

and/or dateparser:

https://dateparser.readthedocs.io/en/latest/

Julius
  • 1,451
  • 1
  • 15
  • 19
  • I have read those two couple hours ago, but still not able to apply them to a pandas data frame having tens of different date time format. Hence, the question in this thread. I guess the 'sort' thing should be easy once I am able to extract date time element in each row of text from the data frame. – Chris T. Aug 04 '17 at 17:38
0

try this """(r'(?:\d{1,2}[ ][/-])?(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* )?(?:\d{1,2}[/-])?\d{2,4}')"""

Pawanvir singh
  • 373
  • 6
  • 17