Python regex extract strings between <>

Question

So I have a string like this

<TEST>foo bar</TEST>

I want to use a regex to extract the foo bar.

I'm using this currently, but it's not working

typesRegex = re.compile('<\w+>(\w+)<\w+>')
typesRegex.match(testStr)

why?

If you plan on parsing HTML with regular expressions besides this example, please [read this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) and google why it's a bad idea if you still think you should do it. — msvalkon, Feb 14 '14 at 08:12
@msvalkon: That's very true, though that post is not the most helpful. I think [this one](http://stackoverflow.com/a/1758162) from the same question is more informative, if not famous. — icktoofay, Feb 14 '14 at 08:13
@icktoofay there are however plenty of differing answers in that question, which I think make it a good resource. Fixed the link to point to the question and not the answer. — msvalkon, Feb 14 '14 at 08:16

score 1 · Accepted Answer · answered Feb 14 '14 at 08:05

1

Because \w does not match space.

foo bar contains a spaces.

Also </TEST> contains / which neither does not match \w.

>>> re.match(r'<\w+>([\w\s]+)</\w+>', '<TEST>foo bar</TEST>')
<_sre.SRE_Match object at 0x0000000002AFDBE8>
>>> _.groups()
('foo bar',)

answered Feb 14 '14 at 08:05

falsetru

357,413
63
732
636

oh so do I need to add a kleene star? (\w+\s+)*? – praks5432 Feb 14 '14 at 08:05
how would I match the foo bar? – praks5432 Feb 14 '14 at 08:06
ah ok will this work for something like this bladi blah foo bar ? – praks5432 Feb 14 '14 at 08:07
@praks5432, Do you mean text with leading, trailing spaces? Then, yes it will. – falsetru Feb 14 '14 at 08:08
@praks5432, BTW do you know difference between `re.search()` and `re.match()`? If you don't, please read [serach() vs match()](http://docs.python.org/2/library/re.html#search-vs-match) – falsetru Feb 14 '14 at 08:09

score 0 · Answer 2 · answered Feb 14 '14 at 08:25

This regex is more flexible and conceptually simple: "until the beginning of the next opening angle bracket"

>>> import re                     
>>> r = re.compile(r'<test>([^<]*)</test>', re.I)
>>> r.search("<TEST>foo bar </test>").group(1)
'foo bar '
>>> r.search("<TEST>Charles Camille Saint-Saens</test>").group(1)
'Charles Camille Saint-Saens'

You should note that \w will match to none of - % @ etc etc...

http://docs.python.org/2/library/re.html#regular-expression-syntax

You have already received many comments to discourage you from parsing HTML by yourself. But I posted this answer of mine in hope that you get the idea of application of finite state machine in parsing texts. HTH

Python regex extract strings between <>

2 Answers2