I have an unstructured set of text files which have multiple occurrence of dates and date duration in various formats like:
- 19 Jan 2015 - 20 May 2015
- 19 Jan 2015
- Jan 2015
- Jan - May 2015
- Jan - May '15
- Jan 2015 to May 2015
And the standard form of dates:
Jan 19, 1990
January 19, 1990
Jan 19,1990
01/19/1990
01/19/90
1990
Jan 1990
January1990
I coded,
re.findall("((?:(?:[0-2]?\\d{1})|(?:[3][01]{1})))(?![\\d])(.)((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Sept|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))(.)((?:(?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3})))(?![\\d])",txt)
to find all occurrences of dates and duration but am not obtaining the desired results. What correct RegEx statement should I use to determine and find all test cases? Ideally, I need to be able to determine all dates and duration given in a text file and extract them.
Sample data from different text files:
The date was 22/06/1995, Mr. Jeff had been working on his book since May 1993 …. The had collaborated during the dispute within the company from 22 Jan 1994 to 28 Jun 1994….They had a history of chronic illness in their family….found in Jan, 1980….he lived short, since 22/01/1996 - 22/08/1999….They were scared to open the ancient Tomb which according to the manuscript had been sealed in June,1560…