0

Let's say that I have a string that looks like this:

my_date = February 4 - March 23, 2015

I want to create a regex that will extract both month names and the year, so I set it up like this:

date_regex = r"^(?:(Jan(?:uary)?|Feb(?:ruary)|Marc?h?|Apr[il1]?[I1l]?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:tober)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d?\d(?:\s+-\s+)?){2},\s+(20[01]\d)"

I thought I was being clever by enclosing the whole regex to match the month and day in a non-matching group and using {2} to say there should be two of them, but unfortunately the groups that I get from this are ("March", "2015"). It seems like it's not capturing the first match of "February".

Where am I going wrong? Is it my regex, or is this just not possible?

This question seems related and seems to imply that what I'm trying to do isn't possible without the regex module.

Thanks

Community
  • 1
  • 1
tblznbits
  • 6,602
  • 6
  • 36
  • 66

2 Answers2

1

Try this RegEx:

(Jan(?:uary)?|Feb(?:ruary)|Marc?h?|Apr[il1]?[I1l]?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:tober)?|Nov(?:ember)?|Dec(?:ember)?|20[01]\d)

You over-complicated it. Just select either a month, or the year (20[01]\d)

Live Demo on Regex101


How it works:

(
    Jan(?:uary)?|          # January
    Feb(?:ruary)|          # February
    Marc?h?|               # March
    Apr[il1]?[I1l]?|       # April
    May|                   # May
    June?|                 # June
    July?|                 # July
    Aug(?:ust)?|           # August
    Sep(?:tember)?|        # September
    Oct(?:tober)?|         # October
    Nov(?:ember)?|         # November
    Dec(?:ember)?|         # December
    20[01]\d               # Year
)

It will select either a month name or a year. I am not sure why you used Apr[il1]?[I1l]? for April. Just use Apr(il)? or Apri?l?

Kaspar Lee
  • 5,446
  • 4
  • 31
  • 54
  • This makes perfect sense to me, but for some reason it's returning `("February", )` What might be causing that? – tblznbits Apr 06 '16 at 16:12
  • @brittenb Are you using the `g`lobal flag? Your RegEx would match the whole thing in one go, so there is no need for the global flag. However this matches each bit individually, so it must be used. – Kaspar Lee Apr 06 '16 at 16:14
  • I was using `re.search`, but that wasn't working. Using `re.findall` fixed the issue. Your answer works as I would hope. Thanks! – tblznbits Apr 06 '16 at 16:15
  • @brittenb No Problem! `;)` – Kaspar Lee Apr 06 '16 at 16:15
0

Another more generic solution if you don't have to search inside a large text, ie only the sample string:

my_date = "February 4 - March 23, 2015"

ss = re.compile(r"[a-zA-Z]+\S|\d{4}")

print ss.findall(my_date)

output:

['February', 'March', '2015']
Javier Clavero
  • 445
  • 5
  • 13