48

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags. Here is my attempt:

regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)

Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']

What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]'] or ['Barrack Obama', 'Bill Gates'].

pb2q
  • 58,613
  • 19
  • 146
  • 147
Ignatius
  • 1,167
  • 2
  • 21
  • 30

5 Answers5

73
import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)

yields

['Barack Obama', 'Bill Gates']

The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same unicode as u'[[1P].+?[/P]]+?' except harder to read.

The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,

  • Remove the outer enclosing square brackets. (Also remove the stray 1 in front of P.)
  • To protect the literal brackets in [P], escape the brackets with a backslash: \[P\].
  • To return only the words inside the tags, place grouping parentheses around .+?.
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
16

Try this :

   for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
        # match start: match.start()
        # match end (exclusive): match.end()
        # matched text: match.group()
FailedDev
  • 26,680
  • 9
  • 53
  • 73
  • 1
    I really like this answer. If you want to process only matches then this does it without any extra statements like 1) save the list, 2) process the list isn't that equivalent to str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher' ## Here re.findall() returns a list of all the found email strings emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com'] for email in emails: # do something with each found email string print email – kkron Aug 13 '14 at 23:10
4

Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:

>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']
Blair
  • 15,356
  • 7
  • 46
  • 56
2

Use this pattern,

pattern = '\[P\].+?\[\/P\]'

Check here

Sohn
  • 166
  • 3
  • 13
  • This is a duplicate answer (adds nothing from the current top answer), but also, incorrect. It will match but not capture anything (there is no capture group) - it doesn't answer the question, which is to use re.findall to get the matched text. – LightCC Aug 08 '22 at 01:53
2

you can replace your pattern with

regex = ur"\[P\]([\w\s]+)\[\/P\]"
Chris Morgan
  • 86,207
  • 24
  • 208
  • 215
pram
  • 1,484
  • 14
  • 17
  • Take care with your formatting; *use the preview region*. Because you didn't format it properly, the backslashes were guzzled (markdown is poor like that). – Chris Morgan Oct 13 '11 at 12:43
  • Why do you do `[\w\s]+` rather than `.*?` which is what he used? Seems to me `.*?` is more likely to be what he wants, anyway. `[\w\s]` is horribly limiting. – Chris Morgan Oct 13 '11 at 12:44
  • The limitation in intentional. I use [\w\s]+ because apparently the asker wants to extract names which rarely contains numbers. Also note that the asker wanted to extract words, not numbers. Just my opinion though, cmiiw – pram Oct 18 '11 at 11:32
  • 2
    What about names with such interesting features as accents? `not re.match('\w', u'é')`. If the names are arbitrary, you should not discount the possibility of non-Latin names. – Chris Morgan Oct 18 '11 at 22:57