3

I'm looking to match years between 1980 and 2050 in sentences, using a regex.

So far I use:

def within_years(d):
    return re.search('20[0-5][0-9]', d) or re.search('19[89][0-9]', d)

The problem now is that I also match "22015".

So I thought to prepend [^0-9], but then it cannot match the year if it is in the start of a sentence.

Next thing was to prepend [ /-]*, but then it is still only optional.

Some examples:

should_match = ['2015 is a great year', 'best year: 2015']
should_not_match = ['22015 bogus', 'a2015 is not a year']
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160

3 Answers3

4

You can use a single regular expression:

(19[89][0-9]|20[0-4][0-9]|2050)

You should add \b boundaries around it though to make sure that nothing surrounds them:

\b(19[89][0-9]|20[0-4][0-9]|2050)\b
>>> valid_year = re.compile(r'\b(19[89][0-9]|20[0-4][0-9]|2050)\b')
>>> should_match = ['2015 is a great year', 'best year: 2015']
>>> should_not_match = ['22015 bogus', 'a2015 is not a year']
>>> for s in should_match:
        print(valid_year.search(s))

<_sre.SRE_Match object; span=(0, 4), match='2015'>
<_sre.SRE_Match object; span=(11, 15), match='2015'>
>>> for s in should_not_match:
        print(valid_year.search(s))

None
None
poke
  • 369,085
  • 72
  • 557
  • 602
4

You can be mechanical about it and just build a string of exclusive alternatives:

>>> r'\b({})\b'.format('|'.join([str(x) for x in range(1980, 2051)]))
'\\b(1980|1981|1982|1983|1984|1985|1986|1987|1988|1989|1990|1991|1992|1993|1994|1995|1996|1997|1998|1999|2000|2001|2002|2003|2004|2005|2006|2007|2008|2009|2010|2011|2012|2013|2014|2015|2016|2017|2018|2019|2020|2021|2022|2023|2024|2025|2026|2027|2028|2029|2030|2031|2032|2033|2034|2035|2036|2037|2038|2039|2040|2041|2042|2043|2044|2045|2046|2047|2048|2049|2050)\\b'

But personally I would match four digits and compare to the target years as an integer:

def within_years(txt, tgt=(1980, 2050)):
    # any valid year in the text
    digits=re.findall(r'\b(\d\d\d\d)\b', txt)
    return any(tgt[0]<= int(e) <= tgt[1] for e in digits)

Or:

def within_years0(txt, tgt=(1980, 2050)):
    # first four standalone digits only
    digits=re.search(r'\b(\d\d\d\d)\b', txt)
    return bool(digits and tgt[0]<= int(digits.group(1)) <= tgt[1])
dawg
  • 98,345
  • 23
  • 131
  • 206
  • +1 This is the best answer. If you are going to use regex for a problem that doesn't really fit regexes, at least be smart about it. – NullUserException Jul 04 '15 at 16:30
  • Hmmm, well I really wish there was an easy way to define a numerical range in regexes, but this doesn't really seem like it will be fast code? – PascalVKooten Jul 04 '15 at 16:32
  • Also, this still matches 22015, I guess because the `\\b` instead of `\b`? Parentheses missing... – PascalVKooten Jul 04 '15 at 16:34
  • It should be `r'\b({})\b'.format(…)` so that the `\b` don’t belong to only the first and the last year. – poke Jul 04 '15 at 16:35
  • Yea, I ran a benchmark, and this is about 40% slower than poke's answer. – PascalVKooten Jul 04 '15 at 16:37
  • The integer method is also slower, but still faster than all these years OR'ed. – PascalVKooten Jul 04 '15 at 16:38
  • Though it of course has to be said that the integer method is nice in the sense that it is easy to adapt the years! – PascalVKooten Jul 04 '15 at 16:39
  • 1
    Two minor points: (1) One disadvantage of this method is that since the verification happens after the fact, if you have "9999 2015", it'll find the 9999 and return False, missing the 2015. To avoid this I usually do a `findall` instead. (2) if something: return True else: return False is just `return bool(something)` (with or without the bool depending on whether it's already one.) – DSM Jul 04 '15 at 16:59
  • @PascalvKooten: Did you include in the benchmark the time to debug a faster regex or change it for different years? – dawg Jul 04 '15 at 17:12
2

You simply use word boundaries \b.

return re.search(r'\b(?:2050|20[0-4][0-9]|19[89][0-9])\b', d)
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Doesn't that also validate "2059" as input within range? I mean, it could be fixed, but just goes to show how silly it is to use regex for this kind of validation. Then someone wants to change the range to 1980-2055 and it's a bunch of code changes instead of changing one number. – NullUserException Jul 04 '15 at 16:21
  • edited. if he wants to match `2050` then `r'\b(?:2050|20[0-4][0-9]|19[89][0-9])\b'` – Avinash Raj Jul 04 '15 at 16:23