I have a HTML file
...<b>Breakfast</b><hr>...
I want Breakfast
which is between >
and <
.
I tried
...for test_string in line:
if re.match(r'(>.*<$)',test_string):...
That didn't give >Breakfast<
either.
Thank you.
I have a HTML file
...<b>Breakfast</b><hr>...
I want Breakfast
which is between >
and <
.
I tried
...for test_string in line:
if re.match(r'(>.*<$)',test_string):...
That didn't give >Breakfast<
either.
Thank you.
In general regular expression can't parse html. You could use an html parser instead:
from BeautifulSoup import BeautifulSoup # pip install BeautifulSoup
html = """...<b>Breakfast</b><hr>..."""
soup = BeautifulSoup(html)
print soup(text=True) # get all text
# -> [u'...', u'Breakfast', u'...']
print [b.text for b in soup('b')] # get all text for <b> tags
# -> [u'Breakfast']
The $
means "end of input" and doesn't belong in this regex.
Instead, do the following:
m = re.search(r'>([^<]*)<', test_string)
if m:
print m.group(1)
This searches for >
, then all the following characters that are not <
, and then <
. The characters betweens >
and <
are marked as a group, which you get using m.group(1)
I think you want:
r'(>.*?<)'
Or maybe
r'<b(>.*?<)/b>'
which is non-greedy and matches in the middle of a string. Note that parsing HTML with regular expressions is not very robust.