0

I have the following code grabbed from a webpage source code:

<span>41,396</span>

And the following regex:

("<span>.*</span>")

Which returns

<span>New Users</span>

However, I don't want to have the tags in the results. I've tried a few things, but Regular Expressions are new to me.

More so than this I need to get the Regex for the following code:

<span>41,396</span>
</span>
<span class="levelColumn">
<span>2,150</span>
</span>
<span class="xpColumn">
<span>161,305,807</span>

I was thinking this may involve line breaks and more, which is why I threw this is separately.

Skeela87
  • 701
  • 6
  • 12
  • 17
  • Which language are your trying to implement this? – Chandu Apr 15 '11 at 13:51
  • What do you want the regex to do with the code snippet? – Fermin Apr 15 '11 at 13:52
  • 2
    It's not a good idea to use regex to pars (X)HTML: see [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) on a discussion why. You'd better use an (X)HTML parser and process your (X)HTML programmatically. Which programming language are you using? – MarcoS Apr 15 '11 at 13:53
  • This is in VB.net using a HTTPwebrequest. It returns the result into a list box. The program is fine apart from know how to regex it. I also have no idea what a parser is or means. – Skeela87 Apr 15 '11 at 14:01
  • Do not use the greedy dot-star! It will erroneously match _much_ more than you bargained for. (To match a simple SPAN containing only numbers and whitespace, use something like this instead: `"[\d.,\s]+`) _Say what you mean, mean what you say!_ – ridgerunner Apr 15 '11 at 15:08

3 Answers3

2

You could try something like

<span( class=\".+\")?>(.*)</span>

And then get capture group 2 for the tag's body. But be aware that regular expressions are NOT good for parsing HTML/XML. What would happen if you had nested <span> tags?

If the input gets even the slightest bit more complicated than what you've shown, look for an HTML parser and try using that instead.

FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202
  • The reason for just using as the regex was just for my own experiement. But I thought if you used the whole 2nd part of the code, it would would fine for grabbing data from a table? – Skeela87 Apr 15 '11 at 14:07
  • @Hayden: Ok, so maybe look into an HTML parser. You *could* do this with regex, but it will be painful. – FrustratedWithFormsDesigner Apr 15 '11 at 14:08
  • Alright, I haven't managed to get any of the answers working right. So I'll take a look. Thanks to everyone who helped. =). Seems it's going to be hard to find a single thing about what a html parser is. – Skeela87 Apr 15 '11 at 14:09
  • @Hayden: What language/platform are you using? Here's a parser for Java: http://htmlparser.sourceforge.net/ – FrustratedWithFormsDesigner Apr 15 '11 at 14:16
  • @Hayden: Ah, you're using .NET. Have you tried this one? http://htmlagilitypack.codeplex.com/ – FrustratedWithFormsDesigner Apr 15 '11 at 14:18
  • Sorry, but I have no idea what to do with that. Threads I find on it are in C# or are also having problems with it. – Skeela87 Apr 15 '11 at 14:33
  • @Hayden: If it's .NET you should be able to use it even from within a VB project. The syntax will look different but you're using a DLL, so it shouldn't be a big problem. I haven't tried using this library from VB.NET it myself though, so, I can't talk from my own experiences, just my expectations... – FrustratedWithFormsDesigner Apr 15 '11 at 14:38
  • Do not use the greedy dot-star/dot-plus! This will erroneously match much more than you bargained for! – ridgerunner Apr 15 '11 at 15:02
  • Its ok, I don't understand any of this stuff. So I came up with a workaround. I just get the string as I have been getting, but I just remove the tags with the .replace function (setting to "") – Skeela87 Apr 15 '11 at 15:08
0

You can use capturing group differently to get the value instead of tag + value

"<span>(.*)</span>"

Think to use a HTML parsing library in your language of choice if regex become more complicated.

ntdt
  • 300
  • 2
  • 4
0

As far as I know regex will lookup line by line, but you could have an expression that would work that out.

Try: <span>(.*)</span>

You should be able to retrieve the information you want with \1

In the case of <span class="xpColumn"> it would just not match and \1 would be empty..

Cheers :)

filippo
  • 5,583
  • 13
  • 50
  • 72