0

I'm supposed to find all the dates from a text document. The dates are in the format of either "24th of April" or "December 18th". I wrote a code which does the job but output is messy.

I've tried to combine the two regex with "|" operator but then I'm getting lots of blank spaces in output.

d1 = "(January|February|March|April|May|June|July|August|September|October|November|December)\s+([0-9]{1,2})(st|nd|rd|th)"

d2 = "([0-9]{1,2})(st|nd|rd|th)\s+(of)\s+(January|February|March|April|May|June|July|August|September|October|November|December)"

e1 = re.compile(d1)
e2 = re.compile(d2)

dat1 = re.findall(e1, text)
dat2 = re.findall(e2, text)

print("\nList of dates in collection are : " + str(dat1) + str(dat2))

Actual Result:

[('January', '6', 'th'), ('January', '2', 'nd')][('4', 'th', 'of', 'March')]

Expected Result:

[('January 6th'), ('January 2nd'), ('4th of March')]
Georgy
  • 12,464
  • 7
  • 65
  • 73

3 Answers3

1

Maybe try this:

>>> import re

>>> string = '''On 24th of April, 1492 Columbus sailed the Ocean Blue
Setting the stage for imperial conquest where the first native was slain on December 18th
This system would continue until April 1st, 2019 when Arijit Jha thought of posting on S.O.
And finally delivered his message on the 11th of April'''



>>> re.findall('(?i)([\d]{1,2}[a-z]{2}[\s\w]{4,5}(?:Jan|Febr|March|April|May|June|July|August|Septem|Octo|Novem|Decem(?:uary|ber)*)|(?:Jan|Febr|March|April|May|June|July|August|Septem|Octo|Novem|Decem(?:uary|ber)*)[\s]{1,2}[\d]{1,2}[a-z]{2})', string)



#OUTPUT
['24th of April', 'December 18th', 'April 1st', '11th of April']

.

.

You can also try the below but this will also Match the month independent of any dates around, which you mightn't want

>>> re.findall('(?i)((?:[\d]{1,2}[a-z]{2}[\ \w]{4,5})*(?:Jan|Febr|March|April|May|June|July|August|Septem|Octo|Novem|Decem(?:uary|ber)*)(?:[\ ]{1,2}[\d]{1,2}[a-z]{2}(?=\s|$|\W))*)', string)
FailSafe
  • 482
  • 4
  • 12
  • It really is an impressive regex. May I suggest breaking it down into smaller strings and concatenating them for better readability? – MattMS Apr 12 '19 at 11:26
  • I will. I'll just wait a bit. I kind of hate it when posters ask questions they're not serious about. Yea, it might help passersby, but I'll wait until the OP returns to make edits. Thanks for the heads up though. – FailSafe Apr 12 '19 at 11:42
  • 1
    That's fair. I'll chuck you an upvote anyway to help out. – MattMS Apr 12 '19 at 12:13
  • Thanks my friend. Much appreciated. – FailSafe Apr 12 '19 at 12:17
1

In case you were unaware, maybe look at the built-in datetime.strptime function and Arrow library first.

While being quite impressed by the regex in the answer from FailSafe, here is my approach:

p = dict(
  day='[0-9]{1,2}',
  month='January|February|March|April|May|June|July|August|September|October|November|December',
  suffix='nd|rd|st|th'
)
a = lambda m: '{month} {day}{suffix}'.format(**m.groupdict())

d1 = '(?P<month>{month})\s+(?P<day>{day})(?P<suffix>{suffix})'.format(**p)
d2 = '(?P<day>{day})(?P<suffix>{suffix})\s+of\s+(?P<month>{month})'.format(**p)

a(re.search(d1, 'January 6th')) # 'January 6th'
a(re.search(d2, '6th of January')) # 'January 6th'

This makes use of the named groups feature of Python regexes and the nice dict features coupled with string formatting.

To take it further (simplifying "d[12]" regexes):

p2 = {k: '(?P<{}>{})'.format(k, v) for k, v in p.items()}
d1 = '{month}\s+{day}{suffix}'.format(**p2)
d2 = '{day}{suffix}\s+of\s+{month}'.format(**p2)
MattMS
  • 1,106
  • 1
  • 16
  • 32
  • Tbh, I'm always afraid of lambda. I've got to get myself comfortable with it and list comprehension again. Great solution, man. – FailSafe Apr 12 '19 at 00:05
  • Thanks, you too! I must admit I had to search for the syntax as it's been a while since using both. I was writing more in the style of working with the REPL rather than how I'd code in a proper source file. – MattMS Apr 12 '19 at 11:20
  • I haven't heard of the `arrow` library, so thanks for posting that info. I have used the libraries `dateparser` and `parsedatetime`. All these libraries have gaps that won't handle certain date formats. – Life is complex Apr 12 '19 at 12:25
0

You are using groups : (opt1|opt2|opt3),
and you don't want them to 'catch' different results.

You should then use non-capturing groups instead : (?:opt1|opt2|opt3),
for example :
(?:January|February|March|April|May|June|July|August|September|October|November|December)

cf : What is a non-capturing group? What does (?:) do?

TheDelta
  • 136
  • 11