1

I have an unstructured set of text files which have multiple occurrence of dates and date duration in various formats like:

  1. 19 Jan 2015 - 20 May 2015
  2. 19 Jan 2015
  3. Jan 2015
  4. Jan - May 2015
  5. Jan - May '15
  6. Jan 2015 to May 2015

And the standard form of dates:

Jan 19, 1990
January 19, 1990
Jan 19,1990
01/19/1990
01/19/90
1990
Jan 1990
January1990

I coded,

re.findall("((?:(?:[0-2]?\\d{1})|(?:[3][01]{1})))(?![\\d])(.)((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Sept|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))(.)((?:(?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3})))(?![\\d])",txt)

to find all occurrences of dates and duration but am not obtaining the desired results. What correct RegEx statement should I use to determine and find all test cases? Ideally, I need to be able to determine all dates and duration given in a text file and extract them.

Sample data from different text files:

The date was 22/06/1995, Mr. Jeff had been working on his book since May 1993 …. The had collaborated during the dispute within the company from 22 Jan 1994 to 28 Jun 1994….They had a history of chronic illness in their family….found in Jan, 1980….he lived short, since 22/01/1996 - 22/08/1999….They were scared to open the ancient Tomb which according to the manuscript had been sealed in June,1560…

User54211
  • 121
  • 2
  • 11
  • Ideally, you could show some *real* input text. Additionally, if you put `re.findall(r'...')` (mind the **r** in the beginning), you do not need to escape the slashes. – Jan Aug 09 '16 at 12:20
  • You can use `datetime.datetime.strptime` with multiple patterns. See the following links for more information. `strptime`: http://strftime.org/ Multiple formats: http://stackoverflow.com/a/23581184/6119465 – 2Cubed Aug 09 '16 at 22:25

0 Answers0