Whats the regular expression for finding string between > and <

Question

I have a HTML file

 ...<b>Breakfast</b><hr>...

I want Breakfast which is between > and <.

I tried

...for test_string in line:
        if re.match(r'(>.*<$)',test_string):...

That didn't give >Breakfast< either.

Thank you.

possible duplicate of [Whats the regular expression for finding string between " "](http://stackoverflow.com/questions/3066328/whats-the-regular-expression-for-finding-string-between) — Ken White, Jan 22 '12 at 06:46
You should look at something like this: http://www.crummy.com/software/BeautifulSoup/ — Chris Cooper, Jan 22 '12 at 06:46
As usual, with anything involving HTML and Regexes: http://stackoverflow.com/a/1732454/118068 — Marc B, Jan 22 '12 at 06:51
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — outis, Jan 22 '12 at 22:44

score 4 · Answer 1 · answered Jan 22 '12 at 07:04

In general regular expression can't parse html. You could use an html parser instead:

from BeautifulSoup import BeautifulSoup # pip install BeautifulSoup

html = """...<b>Breakfast</b><hr>..."""

soup = BeautifulSoup(html)
print soup(text=True) # get all text
# -> [u'...', u'Breakfast', u'...']
print [b.text for b in soup('b')] # get all text for <b> tags
# -> [u'Breakfast']

dorsh · Answer 2 · 2012-01-22T06:53:54.093

3

The $ means "end of input" and doesn't belong in this regex.

Instead, do the following:

m = re.search(r'>([^<]*)<', test_string)
if m:
    print m.group(1)

This searches for >, then all the following characters that are not <, and then <. The characters betweens > and < are marked as a group, which you get using m.group(1)

edited Jan 22 '12 at 06:53

answered Jan 22 '12 at 06:46

dorsh

23,750
2
27
29

How is `[^<]*<` better than `.*?<` in this use case? Surely they are translated to the same code internally. – kindall Jan 22 '12 at 06:54
That's interesting; is one actually any more efficient? – kindall Jan 22 '12 at 07:27
1

@kindall, yes, the `[^<]*` is slightly more efficient http://pastebin.com/jNArurPw – reclosedev Jan 22 '12 at 07:35

score 0 · Answer 3 · edited May 23 '17 at 12:11

0

I think you want:

r'(>.*?<)'

Or maybe

r'<b(>.*?<)/b>'

which is non-greedy and matches in the middle of a string. Note that parsing HTML with regular expressions is not very robust.

edited May 23 '17 at 12:11

Community

1
1

answered Jan 22 '12 at 06:46

Cameron

96,106
25
196
225

Whats the regular expression for finding string between > and <

3 Answers3