105

I am putting together a fairly complex regular expression. One part of the expression matches strings such as '+a', '-57' etc. A + or a - followed by any number of letters or numbers. I want to match 0 or more strings matching this pattern.

This is the expression I came up with:

([\+-][a-zA-Z0-9]+)*

If I were to search the string '-56+a' using this pattern I would expect to get two matches:

+a and -56

However, I only get the last match returned:

>>> m = re.match("([\+-][a-zA-Z0-9]+)*", '-56+a')
>>> m.groups()
('+a',)

Looking at the python docs I see that:

If a group matches multiple times, only the last match is accessible:

>>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
>>> m.group(1)                        # Returns only the last match.
'c3'

So, my question is: how do you access multiple group matches?

Community
  • 1
  • 1
Tom Scrace
  • 1,606
  • 3
  • 13
  • 11

2 Answers2

85

Drop the * from your regex (so it matches exactly one instance of your pattern). Then use either re.findall(...) or re.finditer (see here) to return all matches.

It sounds like you're essentially building a recursive descent parser. For relatively simple parsing tasks, it is quite common and entirely reasonable to do that by hand. If you're interested in a library solution (in case your parsing task may become more complicated later on, for example), have a look at pyparsing.

Neuron
  • 5,141
  • 5
  • 38
  • 59
phooji
  • 10,086
  • 2
  • 38
  • 45
  • Thanks for your response. The problem is that the expression in my question is just part of a much longer expression. I am trying to tokenise a string entered by the user. I think I may have to take a 'divide and conquer' approach, break off the part of the string that will contain the groups identified by this part of the expression and then apply re.findall as you suggest. Thanks again for your help! – Tom Scrace Feb 20 '11 at 23:17
  • It might be worth noting that re.findall(pattern, string) will find all occurrences of pattern within the string, even if such occurrences are non-contiguous. That is: re.findall('a.', 'axayaz') == re.findall('a.', '--ax---ay-----az-------') == ['ax', 'ay', 'az'] – Adeel Zafar Soomro Feb 20 '11 at 23:22
  • 2
    Yes, unfortunately in my case the position within the string is relevant. '+a' in one part of the string could mean something totally different in another part. Thanks. – Tom Scrace Feb 20 '11 at 23:30
  • @Tom: I've added some more high-level links to the answer. If this answers your question, please upvote the answer (green check mark / uparrow) to mark this question as resolved. – phooji Feb 20 '11 at 23:32
  • Thanks for those links phooji - very interesting. I have marked you answer as accepted. Thanks everyone! – Tom Scrace Feb 20 '11 at 23:40
50

The regex module fixes this, by adding a .captures method:

>>> m = regex.match(r"(..)+", "a1b2c3")
>>> m.captures(1)
['a1', 'b2', 'c3']
Eric
  • 95,302
  • 53
  • 242
  • 374
  • 5
    This answer solved my problem more simply than the currently accepted answer. The `regex` module is also supposed to replace the Python `re` module in the future. – Rubinous May 17 '16 at 11:07
  • 4
    What is this dark magic! – Kwame Aug 27 '20 at 20:56