1

I am trying to write a regular expression to catch different format of dates.

The sentences are in a series and each sample of the series contains only one date, but may have other numbers.

The format of dates is like this:

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010

For years that only have two digits we assume it is a 20th century year (i.e. 19nn)

Here is my regular expression:

df_dates = df.str.extract(r'((?:\d{1,2})?[-/\s,]{0,2}(?:\d{1,2})?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[-/\s,]{0,2}(?:19|20)?\d{2})')

My regex produces these results:

input1

Lab: B12 969 2007\n

found1

12,969

input2

Contemplating jumping off building - 1973 - difficulty writing paper.\n

found2

1973

Question

How do I change my regex to obtain the desired results?

jwpfox
  • 5,124
  • 11
  • 45
  • 42
Yanpei
  • 31
  • 1
  • 4
  • I don't see why "1973" shouldn't be matched. Your last example is "2010", which has the same format as "1973". – Racso Dec 19 '17 at 00:34
  • Not sure if duplicate. This is a question about regular expressions, while the other isn't restricted to regex. That being said, I do think that the answers in the question you link are probably useful for this case, too. – Racso Dec 19 '17 at 02:10

1 Answers1

4

I strongly believe that you should try to use several regular expressions to process your data instead of trying to do everything with a single one. That way, you'll have a way more flexible system, as adding new date formats would be way easier than trying to edit a difficult to read regex to make it even more obscure.

Given that you're using regex with a programming language, you can generate regex with code, so you don't duplicate strings. As an example, consider this quick, incomplete and dirty snippet:

import re

monthsShort="Jan|Feb"
monthsLong="January|February"
months="(" + monthsShort + "|" + monthsLong + ")"
separators = "[/-]"
days = "\d{2}"
years = "\d{4}"

regex1 = months + separators + days
regex2 = days + separators + months

print(re.search(regex1,"Jan/01"))

In the end, I have a couple of regex I can use to match two date formats. Completing the regular expressions is trivial, and adding more formats is easy. The whole thing is easier to read. Of course, you have to be careful when concatenating pieces of regex (as you may forget things like parenthesis), but I think that's way easier to do than dealing with obscure regular expressions.

EDIT: I forgot to mention something: after generating your regular expressions, you can add them, for example, to a list, so you can iterate them and apply them to your text within a single loop. Or, if you really want it, you can generate a single regex with all of them (by using parentheses and vertical bars) and apply them with a single statement.

Racso
  • 2,310
  • 1
  • 18
  • 23