-5

I need to find a Python regular expression in order to match every valid date in a raw text file. I split the text in lines and put them in a Pandas Series, the goal now, is to extract only the date in every line getting a series of dates. I was able to match most of the numerical date formats, but I stopped when I had to deal with literal months (Jan, January, Feb, February,...). In particular, I need a regex (or a set of them) which match the following formats:

- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010

Any help will be appreciated, thank you in advance!

ᴀʀᴍᴀɴ
  • 4,443
  • 8
  • 37
  • 57
Davide Tamburrino
  • 581
  • 1
  • 5
  • 11

1 Answers1

1

In line with the comment I made, suggest using split and strip to generate a list of possible dates from your output string and then feed it to dateutils.parser.parse() to turn into a proper datetime object which you can manipulate to your liking.

Possible implementation below:

test = '''- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010'''
list_of_dates = []
for line in test.split('\n'):
    for date in line.split(';'):
        list_of_dates.append(date.strip(' - '))
from dateutil.parser import parse

def is_date(string):
    try: 
        parse(string)
        return True
    except ValueError:
        return False
found_dates = []
for date in list_of_dates:
    if is_date(date):
       found_dates.append(parse(date))
for date in found_dates:
    print(date)

Result:

2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-21 00:00:00
2009-03-22 00:00:00
2009-02-04 00:00:00
2009-09-04 00:00:00
2010-10-04 00:00:00
BoboDarph
  • 2,751
  • 1
  • 10
  • 15
  • Thank you, but this is not my scenario. I have a Series where every item is a line of text containing a date somewhere in any format.So I can't split or strip text. – Davide Tamburrino Aug 04 '17 at 08:56