Python regex findall

Question

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags. Here is my attempt:

regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)

Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']

What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]'] or ['Barrack Obama', 'Bill Gates'].

unutbu · Accepted Answer · 2011-10-13T10:32:59.450

import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)

yields

['Barack Obama', 'Bill Gates']

The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same unicode as u'[[1P].+?[/P]]+?' except harder to read.

The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,

Remove the outer enclosing square brackets. (Also remove the stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a backslash: \[P\].
To return only the words inside the tags, place grouping parentheses around .+?.

score 16 · Answer 2 · answered Oct 13 '11 at 10:21

16

Try this :

   for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
        # match start: match.start()
        # match end (exclusive): match.end()
        # matched text: match.group()

answered Oct 13 '11 at 10:21

FailedDev

26,680
9
53
73

1

I really like this answer. If you want to process only matches then this does it without any extra statements like 1) save the list, 2) process the list isn't that equivalent to str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher' ## Here re.findall() returns a list of all the found email strings emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com'] for email in emails: # do something with each found email string print email – kkron Aug 13 '14 at 23:10

score 4 · Answer 3 · answered Oct 13 '11 at 10:24

Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:

>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']

score 2 · Answer 4 · answered Jul 18 '16 at 06:16

2

Use this pattern,

pattern = '\[P\].+?\[\/P\]'

Check here

answered Jul 18 '16 at 06:16

Sohn

166
3
13

This is a duplicate answer (adds nothing from the current top answer), but also, incorrect. It will match but not capture anything (there is no capture group) - it doesn't answer the question, which is to use re.findall to get the matched text. – LightCC Aug 08 '22 at 01:53

score 2 · Answer 5 · edited Oct 13 '11 at 12:41

2

you can replace your pattern with

regex = ur"\[P\]([\w\s]+)\[\/P\]"

edited Oct 13 '11 at 12:41

Chris Morgan

86,207
24
208
215

answered Oct 13 '11 at 10:31

pram

1,484
14
17

Take care with your formatting; *use the preview region*. Because you didn't format it properly, the backslashes were guzzled (markdown is poor like that). – Chris Morgan Oct 13 '11 at 12:43
Why do you do `[\w\s]+` rather than `.*?` which is what he used? Seems to me `.*?` is more likely to be what he wants, anyway. `[\w\s]` is horribly limiting. – Chris Morgan Oct 13 '11 at 12:44
The limitation in intentional. I use [\w\s]+ because apparently the asker wants to extract names which rarely contains numbers. Also note that the asker wanted to extract words, not numbers. Just my opinion though, cmiiw – pram Oct 18 '11 at 11:32
2

What about names with such interesting features as accents? `not re.match('\w', u'é')`. If the names are arbitrary, you should not discount the possibility of non-Latin names. – Chris Morgan Oct 18 '11 at 22:57

Python regex findall

5 Answers5

Linked

Related