1

Given a text, I need to check for each char if it has exactly (edited) 3 capital letters on both sides and if there are, add it to a string of such characters that is retured.

I wrote the following: m = re.match("[A-Z]{3}.[A-Z]{3}", text) (let's say text="AAAbAAAcAAA")

I expected to get two groups in the match object: "AAAbAAA" and "AAAcAAA"

Now, When i invoke m.group(0) I get "AAAbAAA" which is right. Yet, when invoking m.group(1), I find that there is no such group, meaning "AAAcAAA" wasn't a match. Why?

Also, when invoking m.groups(), I get an empty tuple although I should get a tuple of the matches, meaning that in my case I should have gotten a tuple with "AAAbAAA". Why doesn't that work?

APerson
  • 8,140
  • 8
  • 35
  • 49
user1413824
  • 659
  • 1
  • 8
  • 15

2 Answers2

4

You don't have any groups in your pattern. To capture something in a group, you have to surround it with parentheses:

([A-Z]{3}).[A-Z]{3}

The exception is m.group(0), which will always contain the entire match.

Looking over your question, it sounds like you aren't actually looking for capture groups, but rather overlapping matches. In regex, a group means a smaller part of the match that is set aside for later use. For example, if you're trying to match phone numbers with something like

([0-9]{3})-([0-9]{3}-[0-9]{4})

then the area code would be in group(1), the local part in group(2), and the entire thing would be in group(0).

What you want is to find overlapping matches. Here's a Stack Overflow answer that explains how to do overlapping matches in Python regex, and here's my favorite reference for capture groups and regex in general.

Community
  • 1
  • 1
Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
  • ohh i just love this site,thx. so each parentheses defines a group? and what about parentheses such as (?=...), meaning, with q mark. And i still dont know why doesn't my regex work – user1413824 May 25 '12 at 17:30
  • 1
    (?=) is a positive lookahead. It means that the engine will look forward in the string to determine a match without consuming the characters it inspects. – Silas Ray May 25 '12 at 17:31
  • They suggest using finditer, it's documentation says: "Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string". it doesn't help me, i even tried it.. – user1413824 May 25 '12 at 18:00
2

One, you are using match when it looks like you want findall. It won't grab the enclosing capital triplets, but re.findall('[A-Z]{3}([a-z])(?=[A-Z]{3})', search_string) will get you all single lower case characters surrounded on both sides by 3 caps.

Silas Ray
  • 25,682
  • 5
  • 48
  • 63
  • Thanks, i see it works. why doesn't the left expression [A-Z]{3} surrounded with parentheses ? When im surrounding it with parentheses i get no matches, why? – user1413824 May 25 '12 at 18:03
  • Not sure why you get no matches when you put it in parens... but it's not in parens because it's not a match group or a look ahead or look behind. – Silas Ray May 25 '12 at 18:42
  • so why is the last one in parens? can you explain all the parens in this regex? it's really important for me to understand. – user1413824 May 25 '12 at 18:47
  • There are parens surrounding the argumenst to `findall, then the parens in `([a-z])` are defining that as a capture group, then the ones in `(?=[A-Z]{3})` are defining the bounds of the lookahead term. – Silas Ray May 25 '12 at 18:49
  • my problem is with understanding the lookahead. i read that in regular-expression.info: given (?=regex) the explaination is: Zero-width positive lookahead. Matches at a position where the pattern inside the lookahead can be matched. Matches only the position. It does not consume any characters or expand the match. In a pattern like one(?=two)three, both two and three have to match at the position where the match of one ends. I really don't understand this explaination and all other explaination on that subject in general. also, what is consume,expand in that context? – user1413824 May 25 '12 at 18:56
  • The regex engine, when you perform any type of search on a string, iterates over the characters in the string. Normally, when it iterates over a character, it treats the character as having been searched over and will not match against it again. What the lookahead does is make the engine inspect the characters without consuming them, thus making the engine reprocess them after the match is either made or failed. Not expanding the match means that even though the pattern won't be treated as matched unless the lookahead is found, the lookahead won't be contained in the match group. – Silas Ray May 25 '12 at 19:00
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/11758/discussion-between-sr2222-and-user1413824) – Silas Ray May 25 '12 at 19:02
  • Oops, didn't mean to click that... but yeah, I would reiterate @Justin Morgan's suggestion to read through everything at http://www.regular-expressions.info/tutorial.html. – Silas Ray May 25 '12 at 19:03