Given multiple patterns that need to be extracted from the text below.
import re
s = 'This text2 11contains 4 numbers and 6 words.'
For example assuming I need to extract words and numbers separately, I can do:
>>> re.findall(r'[a-zA-Z]+', s)
['This', 'text', 'contains', 'numbers', 'and', 'words']
and
>>> re.findall(r'\d+', s)
['2', '11', '4', '6']
However, if I want to combine both in a single expression, I would do:
>>> re.findall(r'[a-zA-z]+|\d+', s)
['This', 'text', '2', '11', 'contains', '4', 'numbers', 'and', '6', 'words']
But I need to tell which belongs to which pattern. In this case I can simply check if isnumeric()
and group each accordingly but with the patterns getting any more complicated, there is no way of telling unless each pattern is extracted separately which will eventually become inefficient if a large number of documents and patterns need to be extracted separately. What would be a way of obtaining the match types? For the example above it would be something like:
['This', 'text', '2', '11', 'contains', '4', 'numbers', 'and', '6', 'words']
['word', 'word', 'number', ...]
or a simple enumeration of each group [0, 0, 1, 1, ...]
.