Simple Regex from HTML

Question

I have the following code grabbed from a webpage source code:

<span>41,396</span>

And the following regex:

("<span>.*</span>")

Which returns

<span>New Users</span>

However, I don't want to have the tags in the results. I've tried a few things, but Regular Expressions are new to me.

More so than this I need to get the Regex for the following code:

<span>41,396</span>
</span>
<span class="levelColumn">
<span>2,150</span>
</span>
<span class="xpColumn">
<span>161,305,807</span>

I was thinking this may involve line breaks and more, which is why I threw this is separately.

It's not a good idea to use regex to pars (X)HTML: see [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) on a discussion why. You'd better use an (X)HTML parser and process your (X)HTML programmatically. Which programming language are you using? — MarcoS, Apr 15 '11 at 13:53
This is in VB.net using a HTTPwebrequest. It returns the result into a list box. The program is fine apart from know how to regex it. I also have no idea what a parser is or means. — Skeela87, Apr 15 '11 at 14:01
Do not use the greedy dot-star! It will erroneously match _much_ more than you bargained for. (To match a simple SPAN containing only numbers and whitespace, use something like this instead: `"[\d.,\s]+`) _Say what you mean, mean what you say!_ — ridgerunner, Apr 15 '11 at 15:08

score 2 · Accepted Answer · answered Apr 15 '11 at 13:54

2

You could try something like

<span( class=\".+\")?>(.*)</span>

And then get capture group 2 for the tag's body. But be aware that regular expressions are NOT good for parsing HTML/XML. What would happen if you had nested <span> tags?

If the input gets even the slightest bit more complicated than what you've shown, look for an HTML parser and try using that instead.

answered Apr 15 '11 at 13:54

FrustratedWithFormsDesigner

26,726
31
139
202

The reason for just using as the regex was just for my own experiement. But I thought if you used the whole 2nd part of the code, it would would fine for grabbing data from a table? – Skeela87 Apr 15 '11 at 14:07
@Hayden: Ok, so maybe look into an HTML parser. You *could* do this with regex, but it will be painful. – FrustratedWithFormsDesigner Apr 15 '11 at 14:08
Alright, I haven't managed to get any of the answers working right. So I'll take a look. Thanks to everyone who helped. =). Seems it's going to be hard to find a single thing about what a html parser is. – Skeela87 Apr 15 '11 at 14:09
@Hayden: What language/platform are you using? Here's a parser for Java: http://htmlparser.sourceforge.net/ – FrustratedWithFormsDesigner Apr 15 '11 at 14:16
@Hayden: Ah, you're using .NET. Have you tried this one? http://htmlagilitypack.codeplex.com/ – FrustratedWithFormsDesigner Apr 15 '11 at 14:18
Sorry, but I have no idea what to do with that. Threads I find on it are in C# or are also having problems with it. – Skeela87 Apr 15 '11 at 14:33
@Hayden: If it's .NET you should be able to use it even from within a VB project. The syntax will look different but you're using a DLL, so it shouldn't be a big problem. I haven't tried using this library from VB.NET it myself though, so, I can't talk from my own experiences, just my expectations... – FrustratedWithFormsDesigner Apr 15 '11 at 14:38
Do not use the greedy dot-star/dot-plus! This will erroneously match much more than you bargained for! – ridgerunner Apr 15 '11 at 15:02
Its ok, I don't understand any of this stuff. So I came up with a workaround. I just get the string as I have been getting, but I just remove the tags with the .replace function (setting to "") – Skeela87 Apr 15 '11 at 15:08

score 0 · Answer 2 · answered Apr 15 '11 at 13:53

0

You can use capturing group differently to get the value instead of tag + value

"<span>(.*)</span>"

Think to use a HTML parsing library in your language of choice if regex become more complicated.

answered Apr 15 '11 at 13:53

ntdt

300
2
4

score 0 · Answer 3 · answered Apr 15 '11 at 13:55

As far as I know regex will lookup line by line, but you could have an expression that would work that out.

Try: <span>(.*)</span>

You should be able to retrieve the information you want with \1

In the case of <span class="xpColumn"> it would just not match and \1 would be empty..

Cheers :)

Simple Regex from HTML

3 Answers3