1

I don't understand why the two follow code blocks give a different result for the test case. The first code block returns 2016 as expected, but when a make a small change the second one returns None.

Here's the first one, which returns '2016' as expected

import re
date = '24 Jan 2016'
def func(line):
    month_regex = re.search('(\d{1,2})\s(Jan)\s(\d{2,4})', line)
    if month_regex:
        year = month_regex.group(3)
    return year
func(date)

Then, I add "(uary)?" to the regex and for some reason, it returns None. Note that the results of group(1) and group(2) work the same in both cases.

import re
date = '24 Jan 2016'
def func(line):
    month_regex = re.search('(\d{1,2})\s(Jan(uary)?)\s(\d{2,4})', line)
    if month_regex:
        year = month_regex.group(3)
    return year
func(date)

Why does the second code block return None?

jss367
  • 4,759
  • 14
  • 54
  • 76
  • 2
    The first snippet returns the year. See https://ideone.com/1v7FQe. The second snippet [returns `None`](https://ideone.com/BWIkcq), because the 3rd group did not match. What is your final goal? Return the year? Then [use `.group(4)` in the second snippet](https://ideone.com/6qT6Jr). – Wiktor Stribiżew Aug 04 '17 at 14:11
  • Why doesn't the third group match? Wouldn't \d{2,4} match 2016? – jss367 Aug 04 '17 at 14:15
  • 1
    But `(uary)?` *is* a **capturing** group, and it is assigned ID=3 (since it is the third capturing group in the pattern). `(\d{2,4})` is the 4th group. – Wiktor Stribiżew Aug 04 '17 at 14:16
  • Ah, I meant to make it a non-capturing group. Thanks! Go ahead and add that as an answer and I'll select it. – jss367 Aug 04 '17 at 14:17
  • Refining your patterns with an online regular expression tester really helps. There are several that use Python flavored regexes. – wwii Aug 04 '17 at 14:20

1 Answers1

5

Since the second regex contains an extra (uary)? capturing group, the match result now contains 4 groups and .group(3) no longer maps to the year, but to the optional uary substring. And since the input has no uary after Jan, the group value is None, it was just left uninitialized.

The best way to add optional groups while keeping the group structure intact is via non-capturing groups:

month_regex = re.search(r'(\d{1,2})\s(Jan(?:uary)?)\s(\d{2,4})', line)
                                         ^^ 

See the Python demo online.

Here, the groups are the following:

(\d{1,2})\s(Jan(?:uary)?)\s(\d{2,4})
|-- 1 --|  |---- 2 -----|  |-- 3 --|

Besides, it is best practice to use raw string literals for regex pattern declaration, just to avoid further misunderstanding when you add word boundaries or backreferences.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563