1

I need to match a string to identify if it's valid date range or not, my string could include both months in text and years in numbers, with out specific order ( there's no fixed format like MM-YYYY-DD etc ).

A valid string could be:

February 2016 - March 2019

September 2015 to August 2019

April 2015 to present

September 2018 - present

Invalid string:

George Mason University august 2019

Stratusburg university February 2018

Some text and month followed by year

I already looked into issues such as a) Constructing Regular Expressions to match numeric ranges

b) Regex to match month name followed by year

and many others, but most of the input strings in those issues seems to have the luxury of some fixed pattern for the month and year, which I don't have.

I tried this regex in python:

import re

pat = r"(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?(\d{1,2}(st|nd|rd|th)?)?(([,.\-\/])\D?)?((19[7-9]\d|20\d{2})|\d{2})*"

st =  "University of Pennsylvania February 2018"

re.search(pat, st)

but that recognizes both valid and invalid strings from my example, I want to avoid invalid strings in my eventual output.

For input "University of Pennsylvania February 2018" the expected output should be False

For "February 2018 to Present",output must be True.

Community
  • 1
  • 1
Amith Adiraju
  • 306
  • 4
  • 18
  • 1
    See [regex-to-match-month-name-followed-by-year](https://stackoverflow.com/questions/2655476/regex-to-match-month-name-followed-by-year) for a simplified version of this question. – Jeremy Boden Sep 26 '19 at 01:10

2 Answers2

1

Maybe, you could reduce the boundaries of your expression with some simple ones such as:

(?i)^\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?$

or maybe,

(?i)\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?

Test

import re

regex = r"(?i)^\S+\s+(\d{2})?(\d{2})\s*(?:[-_]|to)\s*(present|\S+)\s*(\d{2})?(\d{2})?$"
string = """
February 2016 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
Feb. 2016 - March 2019
Sept 2015 to Aug. 2019
April 2015 to present
Nov. 2018 - present

Invalid string:
George Mason University august 2019

Stratusburg university February 2018

Some text and month followed by year
"""

print(re.findall(regex, string, re.M))

Output

[('20', '16', 'March', '20', '19'), ('20', '15', 'August', '20', '19'), ('20', '15', 'present', '', ''), ('20', '18', 'present', '', ''), ('20', '16', 'March', '20', '19'), ('20', '15', 'Aug.', '20', '19'), ('20', '15', 'present', '', ''), ('20', '18', 'present', '', '')]

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Emma
  • 27,428
  • 11
  • 44
  • 69
1

This REGEX validate date range that respect this format MONTH YEAR (MONTH YEAR | PRESENT)

import re
# just for complexity adding to valid range in first line
text = """
February 2016 - March 2019 February 2017 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
# writing the REGEX in one line will make it very UGLY 
MONTHS_RE = ['Jan(?:uary)?', 'Feb(?:ruary)', 'Mar(?:ch)', 'Apr(?:il)?', 'May', 'Jun(?:e)?', 'Aug(?:ust)?', 'Sep(?:tember)?',
             '(?:Nov|Dec)(?:ember)?']
# to match MONTH NAME and capture it (Jan(?:uary)?|Feb(?:ruary)...|(?:Nov|Dec)(?:ember)?)
RE_MONTH = '({})'.format('|'.join(MONTHS_RE))
# THIS MATCHE  MONTH FOLLOWED BY YEAR{2 or 4} I will use two times in Final REGEXP
RE_DATE = '{RE_MONTH}(?:[\s]+)(\d{{2,4}})'.format(RE_MONTH=RE_MONTH)
# FINAL REGEX
RE_VALID_RANGE = re.compile('{RE_DATE}.+?(?:{RE_DATE}|(present))'.format(RE_DATE=RE_DATE), flags=re.IGNORECASE)


# if you want to extract both valid an invalide
valid_ranges = []
invalid_ranges = []
for line in text.split('\n'):
    if line:
        groups = re.findall(RE_VALID_RANGE, line)
        if groups:
            # If you want to do something with range
            # all valid ranges are here my be 1 or 2 depends on the number of valid range in one line
            # every group have 4 elements because there is 4 capturing group
            # if M2,Y2 are not empty present is empty or the inverse only one of them is there (because of (?:{RE_DATE}|(present)) )
            M1, Y1, M2, Y2, present = groups[0]  # here use loop if you want to verify the values even more
            valid_ranges.append(line)
        else:
            invalid_ranges.append(line)

print('VALID: ', valid_ranges)
print('INVALID:', invalid_ranges)


# this yields only valid ranges if there is 2 in one line will yield two valid ranges
# if you are dealing with lines this is not what you want
valid_ranges = []
for match in re.finditer(RE_VALID_RANGE, text):
    # if you want to check the ranges
    M1, Y1, M2, Y2, present = match.groups()
    valid_ranges.append(match.group(0))  # the text is returned here
print('VALID USING <finditer>: ',  valid_ranges)

OUPUT:

VALID:  ['February 2016 - March 2019 February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
INVALID: ['George Mason University august 2019', 'Stratusburg university February 2018', 'Some text and month followed by year']
VALID USING <finditer>:  ['February 2016 - March 2019', 'February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']

I hate writing long regular expression in a single str variable I love to break it to understand what It does when I read my code after six Months. Note how the first line is divided to two valid range string using finditer

If you want just to extract ranges you can use this:

valid_ranges = re.findall(RE_VALID_RANGE, text)

But this returns the groups ([M1, Y1, M2, Y2, present)..] not the Text :

[('February', '2016', 'March', '2019', ''), ('February', '2017', 'March', '2019', ''), ('September', '2015', 'August', '2019', ''), ('April', '2015', '', '', 'present'), ('September', '2018', '', '', 'present')]
Charif DZ
  • 14,415
  • 3
  • 21
  • 40
  • 1
    `Jun(?:e)?` = `June?` – mickmackusa Sep 26 '19 at 11:35
  • 1
    @CharifDZ I wish I could give you likes in an infinite loop ! Thank you for a clear and precise explanation. This is exactly what I wanted. Shows your efficiency of coding. – Amith Adiraju Sep 26 '19 at 16:10
  • @CharifDZ I'd like to make one tiny change ( it's totally optional but would like to include this edge case as well ). I'd like to treat _pals october 2018 to october 2019_ as valid as welll, I changed re to `'([\w]\s)?{RE_DATE}.+?(?:{RE_DATE}|(present))'` but the above sentence was not treated as valid one, any suggestions on this ? Typically I want to treat `optional_word month1 year1 ( - or to ) month2 year2 | present` as valid. – Amith Adiraju Sep 26 '19 at 16:57
  • What is the sentence exactly? It looks like the other for me? why it's not matched!! Can you post a comment with the sentence – Charif DZ Sep 26 '19 at 17:34
  • 1
    @CharifDZ There was a typo, I already added `october` ,it worked, my bad ! – Amith Adiraju Sep 26 '19 at 17:46
  • We just forget to add `october` to `MONTHS_RE`, try `'Oct(?:ober)'` – Charif DZ Sep 26 '19 at 17:46