How to determine which match belongs to which pattern?

Question

Given multiple patterns that need to be extracted from the text below.

import re

s = 'This text2 11contains 4 numbers and 6 words.'

For example assuming I need to extract words and numbers separately, I can do:

>>> re.findall(r'[a-zA-Z]+', s)
['This', 'text', 'contains', 'numbers', 'and', 'words']

and

>>> re.findall(r'\d+', s)
['2', '11', '4', '6']

However, if I want to combine both in a single expression, I would do:

>>> re.findall(r'[a-zA-z]+|\d+', s)
['This', 'text', '2', '11', 'contains', '4', 'numbers', 'and', '6', 'words']

But I need to tell which belongs to which pattern. In this case I can simply check if isnumeric() and group each accordingly but with the patterns getting any more complicated, there is no way of telling unless each pattern is extracted separately which will eventually become inefficient if a large number of documents and patterns need to be extracted separately. What would be a way of obtaining the match types? For the example above it would be something like:

['This', 'text', '2', '11', 'contains', '4', 'numbers', 'and', '6', 'words']
['word', 'word', 'number', ...]

or a simple enumeration of each group [0, 0, 1, 1, ...].

I'm not sure if I understand you correctly. Perhaps you can enclose a branch in capturing group (e.g. `([a-zA-Z]+)|\d+`) then check if that group was matched (e.g. `if match[1]: ...`)? — InSync, Jul 09 '23 at 17:59
Note that `A-z` is probably not what you want. See [this question](https://stackoverflow.com/q/4923380). — InSync, Jul 09 '23 at 18:00
@InSync failing shift key, I mean `A-Z`. Simply if i have `patterns = [p1, p2, p3, p4, p5, ...]` and some text, i need to extract all in a single expression and be able to tell which matches in the `findall` results belongs to which pattern. — nlblack323, Jul 09 '23 at 18:02
So you think obtaining whether it matched a number or letter is magically handed out by `findfall()` ? Pretty ridiculous since the only thing you have to test is weather one group matched or not. You think all the answers below get you out of checking weather a group matched, or weather an element in an array is `word` or `number`, or even if an element has no length ? You haven't shown any relationship between the words or numbers. Come on get real ! — sln, Jul 09 '23 at 19:55
Either `re.findall(r'[a-zA-z]+|\d+', s)` or `re.findall(r'[a-zA-z]+', s) re.findall(r'\d+', s)` or `re.findall(r'([a-zA-z]+)|(\d+)', s)` — sln, Jul 09 '23 at 19:59
If you use `re.findall(r'[a-zA-z]+|\d+', s)` you can sort the array. — sln, Jul 09 '23 at 20:00

score 3 · Accepted Answer · answered Jul 09 '23 at 18:09

Use named group matching:

s = 'This text2 11contains 4 numbers and 6 words.'
matches_it =  re.finditer(r'(?P<word>[a-zA-Z]+)|(?P<number>\d+)', s)
res = [(m.group(), m.lastgroup) for m in matches_it]

[('This', 'word'), ('text', 'word'), ('2', 'number'), ('11', 'number'), ('contains', 'word'), ('4', 'number'), ('numbers', 'word'), ('and', 'word'), ('6', 'number'), ('words', 'word')]

score 2 · Answer 2 · answered Jul 09 '23 at 18:05

You can use groups, like named groups. For example, with re.finditer() and re.Match.groupdict():

regex = r'(?P<word>[a-zA-Z]+)|(?P<number>\d+)'
matches = re.finditer(regex, s)
for m in matches:
    for k, v in m.groupdict().items():
        if v is not None:
            print(k, v)

word This
word text
number 2
number 11
word contains
number 4
word numbers
word and
number 6
word words

score 2 · Answer 3 · answered Jul 09 '23 at 18:08

Using groups, something like the follwoing will do:

>>> [(w, "word") if w else (n, "number") for w, n in re.findall(r'([a-zA-z]+)|(\d+)', s)]
[('This', 'word'),
 ('text', 'word'),
 ('2', 'number'),
 ('11', 'number'),
 ('contains', 'word'),
 ('4', 'number'),
 ('numbers', 'word'),
 ('and', 'word'),
 ('6', 'number'),
 ('words', 'word')]

How to determine which match belongs to which pattern?

3 Answers3