So I have a string like this
<TEST>foo bar</TEST>
I want to use a regex to extract the foo bar.
I'm using this currently, but it's not working
typesRegex = re.compile('<\w+>(\w+)<\w+>')
typesRegex.match(testStr)
why?
So I have a string like this
<TEST>foo bar</TEST>
I want to use a regex to extract the foo bar.
I'm using this currently, but it's not working
typesRegex = re.compile('<\w+>(\w+)<\w+>')
typesRegex.match(testStr)
why?
Because \w
does not match space.
foo bar
contains a spaces.
Also </TEST>
contains /
which neither does not match \w
.
>>> re.match(r'<\w+>([\w\s]+)</\w+>', '<TEST>foo bar</TEST>')
<_sre.SRE_Match object at 0x0000000002AFDBE8>
>>> _.groups()
('foo bar',)
This regex is more flexible and conceptually simple: "until the beginning of the next opening angle bracket"
>>> import re
>>> r = re.compile(r'<test>([^<]*)</test>', re.I)
>>> r.search("<TEST>foo bar </test>").group(1)
'foo bar '
>>> r.search("<TEST>Charles Camille Saint-Saens</test>").group(1)
'Charles Camille Saint-Saens'
You should note that \w
will match to none of -
%
@
etc etc...
You have already received many comments to discourage you from parsing HTML by yourself. But I posted this answer of mine in hope that you get the idea of application of finite state machine in parsing texts. HTH