So I have a few documents I'm extracting the date from, my regex expression being:
query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril
|[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary
|[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept
|[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""
OR
query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|
[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|
[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|
[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""
The only difference between the two is one has |'s at the beginning of new each line, and the other has it at the end of the new line. These two match different things - specifically, with | at the end of the line I won't match May, but if its at the beginning of a line I won't match January (assuming the rest of the day & yr & spaces are correct - I literally just move the or position around and what I was just matching I no longer match & vice versa). Am I doing something wrong somehow, is there a way around this, or is there correct way to do this instead? Obviously the goal is to match both. If you want to try it out yourself, the cases I can easily replicate are '8 may 2018' and '25 january 2018'.
The rest of my code is just re.search(query, doc) (which is whats failing to match).
Note - python 3.6.8 regex==2018.1.10