1

Given multiple patterns that need to be extracted from the text below.

import re

s = 'This text2 11contains 4 numbers and 6 words.'

For example assuming I need to extract words and numbers separately, I can do:

>>> re.findall(r'[a-zA-Z]+', s)
['This', 'text', 'contains', 'numbers', 'and', 'words']

and

>>> re.findall(r'\d+', s)
['2', '11', '4', '6']

However, if I want to combine both in a single expression, I would do:

>>> re.findall(r'[a-zA-z]+|\d+', s)
['This', 'text', '2', '11', 'contains', '4', 'numbers', 'and', '6', 'words']

But I need to tell which belongs to which pattern. In this case I can simply check if isnumeric() and group each accordingly but with the patterns getting any more complicated, there is no way of telling unless each pattern is extracted separately which will eventually become inefficient if a large number of documents and patterns need to be extracted separately. What would be a way of obtaining the match types? For the example above it would be something like:

['This', 'text', '2', '11', 'contains', '4', 'numbers', 'and', '6', 'words']
['word', 'word', 'number', ...] 

or a simple enumeration of each group [0, 0, 1, 1, ...].

nlblack323
  • 155
  • 1
  • 10
  • I'm not sure if I understand you correctly. Perhaps you can enclose a branch in capturing group (e.g. `([a-zA-Z]+)|\d+`) then check if that group was matched (e.g. `if match[1]: ...`)? – InSync Jul 09 '23 at 17:59
  • Note that `A-z` is probably not what you want. See [this question](https://stackoverflow.com/q/4923380). – InSync Jul 09 '23 at 18:00
  • @InSync failing shift key, I mean `A-Z`. Simply if i have `patterns = [p1, p2, p3, p4, p5, ...]` and some text, i need to extract all in a single expression and be able to tell which matches in the `findall` results belongs to which pattern. – nlblack323 Jul 09 '23 at 18:02
  • So you think obtaining whether it matched a number or letter is magically handed out by `findfall()` ? Pretty ridiculous since the only thing you have to test is weather one group matched or not. You think all the answers below get you out of checking weather a group matched, or weather an element in an array is `word` or `number`, or even if an element has no length ? You haven't shown any relationship between the words or numbers. Come on get real ! – sln Jul 09 '23 at 19:55
  • Either `re.findall(r'[a-zA-z]+|\d+', s)` or `re.findall(r'[a-zA-z]+', s) re.findall(r'\d+', s)` or `re.findall(r'([a-zA-z]+)|(\d+)', s)` – sln Jul 09 '23 at 19:59
  • If you use `re.findall(r'[a-zA-z]+|\d+', s)` you can sort the array. – sln Jul 09 '23 at 20:00

3 Answers3

3

Use named group matching:

s = 'This text2 11contains 4 numbers and 6 words.'
matches_it =  re.finditer(r'(?P<word>[a-zA-Z]+)|(?P<number>\d+)', s)
res = [(m.group(), m.lastgroup) for m in matches_it]

[('This', 'word'), ('text', 'word'), ('2', 'number'), ('11', 'number'), ('contains', 'word'), ('4', 'number'), ('numbers', 'word'), ('and', 'word'), ('6', 'number'), ('words', 'word')]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
2

You can use groups, like named groups. For example, with re.finditer() and re.Match.groupdict():

regex = r'(?P<word>[a-zA-Z]+)|(?P<number>\d+)'
matches = re.finditer(regex, s)
for m in matches:
    for k, v in m.groupdict().items():
        if v is not None:
            print(k, v)
word This
word text
number 2
number 11
word contains
number 4
word numbers
word and
number 6
word words
wjandrea
  • 28,235
  • 9
  • 60
  • 81
2

Using groups, something like the follwoing will do:

>>> [(w, "word") if w else (n, "number") for w, n in re.findall(r'([a-zA-z]+)|(\d+)', s)]
[('This', 'word'),
 ('text', 'word'),
 ('2', 'number'),
 ('11', 'number'),
 ('contains', 'word'),
 ('4', 'number'),
 ('numbers', 'word'),
 ('and', 'word'),
 ('6', 'number'),
 ('words', 'word')]
user2390182
  • 72,016
  • 6
  • 67
  • 89