Regex HTML help

Question

Hey all I'm in need of some help trying to figure out the RegEx formula for finding the values within the tags of HTML mark-up like this:

<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>

I only need 1993, R, 2.8 and 94% from that HTML above.

Any help would be great as I don't have much knowledge when it comes to forming one of these things.

I'd suggest not using regex for a task like this. Read [this question](http://stackoverflow.com/questions/516811/how-do-you-parse-an-html-in-vb-net) on HTML parsing in .NET. — darioo, Apr 04 '11 at 12:23
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Matt Ball, Apr 04 '11 at 12:23
@Kobi it's just the archetypal "Don't use regex to parse (X)HTML" question on SO. — Matt Ball, Apr 04 '11 at 12:30

score 3 · Answer 1 · answered Apr 04 '11 at 12:24

3

Don't use a regular expression to parse HTML. Use an HTML parser. There is a good one here.

answered Apr 04 '11 at 12:24

tster

17,883
5
53
72

Kobi · Answer 2 · 2011-04-04T12:37:46.900

3

If you already have the HTML in a string:

string html = @"
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
";

Or you can load a page from the internet directly (saves you from 5 lines of streams and requests):

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.rottentomatoes.com/m/source_code/");

Using the HTML Agility Pack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection spans = doc.DocumentNode.SelectNodes("//span");

Now you can iterate over them, or simply get the text of each node:

IEnumerable<string> texts = spans.Select(option => option.InnerText).ToList();

Alternatively, you can search for the node you're after:

HtmlNode nodeReleaseYear = doc.DocumentNode
                              .SelectSingleNode("//span[@class='releaseYear']");
string year = nodeReleaseYear.InnerText;

edited Apr 04 '11 at 12:37

answered Apr 04 '11 at 12:31

Kobi

135,331
41
252
292

The code is C#, but it should be easy enough to convert to VB.Net. – Kobi Apr 04 '11 at 12:34
How do i get it working? I put **HtmlDocument doc = new HtmlDocument()** but it has it underlined saying **HtmlDocument' is a type and cannot be used as an expression** and **Name 'doc' is not declared.** – StealthRT Apr 04 '11 at 12:42
@StealthRT, have you added a reference to HtmlAgilityPack in your project? – tster Apr 04 '11 at 12:52
@tster. Yes, i have **Imports HtmlAgilityPack** – StealthRT Apr 04 '11 at 12:56
@StealthRT - I'm not sure what the VB script should look like. I'd *guess* `Dim doc as new HtmlDocument` . Maybe this will help: http://stackoverflow.com/questions/516811/how-do-you-parse-an-html-in-vb-net/2604974#2604974 – Kobi Apr 04 '11 at 13:22

Regex HTML help

2 Answers2