0

My script:

def fetch_online():
    pattern = re.search('(<span class="on">)(.*)(</span>)', data)
    return pattern.group(2)

print fetch_online()

Inside data, there is one line that contains this:

        <b><span><span class="on">5879</span> users online</span></b>

However, when ran, the output is this:

5879</span> users online

How should I fix this so it only grabs the data before the first </span>?

user1417933
  • 805
  • 2
  • 11
  • 14

2 Answers2

4

In your specific case here, got for <span class="on">)(\d+)</span>. In a more general approach, go for non-greedy:

<span class="on">(.*?)</span>
dda
  • 6,030
  • 2
  • 25
  • 34
3

Use the non-greedy quantifier: (<span class="on">)(.*?)(</span>).

To learn more about the non-greedy quantifier, read the "Laziness Instead of Greediness" section at Regular-Expressions.info.

Just to reiterate what has already been said in the comments, parsing HTML using regular expressions is highly discouraged.

Community
  • 1
  • 1
creemama
  • 6,559
  • 3
  • 21
  • 26