1

I was trying a simple regex search to check for validity of an IPv6 address. I first tried a simple example for searching simple hex characters in a 4 block system.

For eg:

The string - acbe:abfe:aaee:afec

I first used the following regex which is working fine:

Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile("[a-f]{4}:[a-f]{4}:[a-f]{4}:[a-f]{4}")
>>> s = "acbe:abfe:aaee:afec"
>>> r.findall(s)
['acbe:abfe:aaee:afec']

Then I tried a different regex since it is repeating:

>>> r = re.compile("([a-f]{4}:){3}[a-f]{4}")
>>> r.findall(s)
['aaee:']

Note only part of the address is returned. When tested on the regex testing website RegexPal, it matches the full addresss.

Why isn't the whole address matched? Doesn't python support grouping of complex regex?

outis
  • 75,655
  • 22
  • 151
  • 221
Kartik Anand
  • 4,513
  • 5
  • 41
  • 72

3 Answers3

2

You need to change your compile line to:

r = re.compile("(?:[a-f]{4}:){3}[a-f]{4}")

When you include groups in your regex, then regex functions (including findall) return groups instead of the entire match. In this case, since it matches 3 times, the result from the last group that matched, which will be the 3rd piece, will be returned.

Adding ?: to the regex causes to be a non-capturing group. This lets you group it for multiple matching, while not letting findall actually capture it. Since now there are no captured groups, findall will return the entire string.

Edit: It appears to work here in python 2.6:

s = "acbe:abfe:aaee:afec"
r.findall(s)
['acbe:abfe:aaee:afec']
Corley Brigman
  • 11,633
  • 5
  • 33
  • 40
  • Whoops that was my downvote, I misunderstood what OP was trying to do, and it's too late to remove my vote! – Adam Smith Mar 11 '14 at 16:59
  • Does this happen only when I use brackets to specify a group? So, it makes the whole regex one big group right? – Kartik Anand Mar 11 '14 at 17:03
  • it's parentheses that define groups... brackets/braces don't cause this behaviour. – Corley Brigman Mar 11 '14 at 17:07
  • Sorry, I meant parentheses :) – Kartik Anand Mar 11 '14 at 17:12
  • Yes.. i've been bitten by this before. Whenever you put in parentheses for groups, you change the behaviour, unless you also put in `?:`. There are cases where you might actually want to extract multiple pieces of information out of one string; groups let you do that. But if you're not trying to do that, you want to use the `?:`. – Corley Brigman Mar 11 '14 at 17:18
  • Exactly so! If you have a set of lines containing address data, e.g. `John Doe 1234 Anywhere Street Apt 42 Anytown,OR 55555` and you only need the city,state, you can do `r = re.compile(r"(\w+),([A-Z]{2}) \d{5})"` and it will return `[('Anytown','OR')]`, then iterate with `for item in re.findall(r,addresslists): for city,state in item:` – Adam Smith Mar 11 '14 at 17:37
1

I'm assuming you're trying to get each four-letter string? You want the findall to return ['acbe','abfe','aaee','afec']?

>>> r = re.compile(r"[a-f]{4}(?=:)|(?<=:)[a-f]{4}")
>>> s = "acbe:abfe:aaee:afec"
>>> r.findall(s)
['acbe', 'abfe', 'aaee', 'afec']
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
1

In "[a-f]{4}:[a-f]{4}:[a-f]{4}:[a-f]{4}" there is no group defined, so re.findall() returns all the groups 0 , that is to say the entires matches, that it detects.

In "([a-f]{4}:){3}[a-f]{4}" , there is one group defined, and re.findall() returns all the portions of the matches that correspond to this group. BUt as this group is repeated, only the last occurence of this group in each total match is returned.

Putting ?: just after the opening paren of the group makes it a non-capturing group, then re.findall() still returns all the entire matches

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • Does this happen only when I use brackets to specify a group? So, it makes the whole regex one big group right? – Kartik Anand Mar 11 '14 at 17:01
  • What is **"this"** in the sentence _"Does this happen only when I use brackets..."_ ?? - In the doc of ``re.findall`` , it is written: _**If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.**"_ – eyquem Mar 11 '14 at 17:04
  • What do you mean by "there is one group defined", you mean the part inside brackets right?, and since that is repeated it is returning only the last group? – Kartik Anand Mar 11 '14 at 17:09
  • In fact, the entire match is always ``group(0)`` aka ``group()``, either there are parens or there are no parens. When there are parens, ``re.findall`` returns all the groups different from ``group(0)``, when there are not ``re.findall`` returns the entires matches = group(0) – eyquem Mar 11 '14 at 17:16
  • Yes, a group is defined by two parentheses: ``(.....)`` define a capturing group. ``(?:.....)`` define a non-capturing group. Note that according to wikipedia : _"Used unqualified, brackets refer to different types of brackets in different parts of the world and in different contexts."_ and ``(`` and ``)`` are parentheses, ``{`` and ``}`` are curly brackets, ``[`` and ``]`` are angle brackets. – eyquem Mar 11 '14 at 17:29
  • Yes i know the stuff about brackets..i just forgot to use the correct word..thanks :) – Kartik Anand Mar 11 '14 at 17:31