1

I want to use the re package in Python to search for some text in a big data set. Maybe I got something wrong, but if I'm using

(\w+,\s?)+ 

I would get a match for something like:

This, is, a, test,

Why isn't this the case in Python?

The following example works only with [] instead of ()

str = 'St. aureus°, unimportant_stuff, Strep. haemol.°'

will_fail = re.compile(r'(\w+\.?\s?)+°')
success = re.compile(r'[\w+\.?\s?]+°')

print(will_fail.findall(str))
print(success.findall(str))

This will result in the output:

['aureus', 'haemol.']             // THIS IS FAIL
['St. aureus°', 'Strep haemol.°'] // THIS IS OK

What am I doing wrong here?

petezurich
  • 9,280
  • 9
  • 43
  • 57
Hannsen
  • 19
  • 2
  • 1
    Use a non-capturing group `(?: )` instead of a capturing group `( )`. – Aran-Fey Apr 05 '19 at 19:01
  • Thank you very much! Is there a special reason why Python is doing this? Is it possible to disable the capturing groups? – Hannsen Apr 05 '19 at 19:06
  • 3
    Also, I hope you realize that `[...]` is completely different from `(...)`. Those two regexes match entirely different things. – Aran-Fey Apr 05 '19 at 19:06
  • 1
    It's a special "feature" of `re.findall`. There's no way to turn it off. There's an alternative to `findall` though: [`re.finditer`](https://docs.python.org/3/library/re.html#re.finditer), which returns an iterator that yields match objects. – Aran-Fey Apr 05 '19 at 19:07
  • Who +1ed this? @Hannsen to explain, what you have in `will fail` is that you're telling regex to specifically capture the group by staging it inside `( )`. If `( )` is absent, your regex itself will capture everything your pattern says by default. Additionally by staging your `success` pattern within `[ ]` you are telling regex to capture anything in any order that matches `\w` OR `\.` OR `\s` OR `?` so because those occur in a group they all get captured. – FailSafe Apr 05 '19 at 19:10
  • Yes I realized that this are two different things, but only the `[...]` worked for me in this short sample, so I was really confused if I tested this with just the lines above. – Hannsen Apr 05 '19 at 19:11
  • @Hannsen Maybe you should read the documentation of [`findAll`](https://docs.python.org/3/library/re.html#re.findall) " If one or more groups are present in the pattern, return a list of groups;". The bracket `()` defines a [capturing group](https://www.regular-expressions.info/brackets.html) in the pattern. Square bracket `[]` has another meaning. This is already explained [here](https://stackoverflow.com/a/31915134/6238076). – gdlmx Apr 05 '19 at 19:15

0 Answers0