0

So I have a string like this

<TEST>foo bar</TEST>

I want to use a regex to extract the foo bar.

I'm using this currently, but it's not working

typesRegex = re.compile('<\w+>(\w+)<\w+>')
typesRegex.match(testStr)

why?

praks5432
  • 7,246
  • 32
  • 91
  • 156
  • 1
    If you plan on parsing HTML with regular expressions besides this example, please [read this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) and google why it's a bad idea if you still think you should do it. – msvalkon Feb 14 '14 at 08:12
  • 1
    @msvalkon: That's very true, though that post is not the most helpful. I think [this one](http://stackoverflow.com/a/1758162) from the same question is more informative, if not famous. – icktoofay Feb 14 '14 at 08:13
  • @icktoofay there are however plenty of differing answers in that question, which I think make it a good resource. Fixed the link to point to the question and not the answer. – msvalkon Feb 14 '14 at 08:16
  • I wasn't planning on parsing html with regexes – praks5432 Feb 14 '14 at 17:52

2 Answers2

1

Because \w does not match space.

foo bar contains a spaces.

Also </TEST> contains / which neither does not match \w.

>>> re.match(r'<\w+>([\w\s]+)</\w+>', '<TEST>foo bar</TEST>')
<_sre.SRE_Match object at 0x0000000002AFDBE8>
>>> _.groups()
('foo bar',)
falsetru
  • 357,413
  • 63
  • 732
  • 636
0

This regex is more flexible and conceptually simple: "until the beginning of the next opening angle bracket"

>>> import re                     
>>> r = re.compile(r'<test>([^<]*)</test>', re.I)
>>> r.search("<TEST>foo bar </test>").group(1)
'foo bar '
>>> r.search("<TEST>Charles Camille Saint-Saens</test>").group(1)
'Charles Camille Saint-Saens'

You should note that \w will match to none of - % @ etc etc...

You have already received many comments to discourage you from parsing HTML by yourself. But I posted this answer of mine in hope that you get the idea of application of finite state machine in parsing texts. HTH

nodakai
  • 7,773
  • 3
  • 30
  • 60