regex grabbing too much info

Question

My script:

def fetch_online():
    pattern = re.search('(<span class="on">)(.*)(</span>)', data)
    return pattern.group(2)

print fetch_online()

Inside data, there is one line that contains this:

        <b><span><span class="on">5879</span> users online</span></b>

However, when ran, the output is this:

5879</span> users online

How should I fix this so it only grabs the data before the first ?

Repeat after me: [Do not try to parse HTML with Regular Expressions](http://stackoverflow.com/a/1732454/100297)! — Martijn Pieters, May 26 '12 at 14:09
Depending on the scope of your project, [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is a pretty good Python library for handling HTML. — huon, May 26 '12 at 14:12

score 4 · Accepted Answer · answered May 26 '12 at 14:11

4

In your specific case here, got for )(\d+). In a more general approach, go for non-greedy:

<span class="on">(.*?)</span>

answered May 26 '12 at 14:11

dda

score 3 · Answer 2 · edited May 23 '17 at 10:09

3

Use the non-greedy quantifier: ()(.*?)().

To learn more about the non-greedy quantifier, read the "Laziness Instead of Greediness" section at Regular-Expressions.info.

Just to reiterate what has already been said in the comments, parsing HTML using regular expressions is highly discouraged.

edited May 23 '17 at 10:09

Community

answered May 26 '12 at 14:09

creemama

2 Answers2