2

Hey all I'm in need of some help trying to figure out the RegEx formula for finding the values within the tags of HTML mark-up like this:

<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span> 

I only need 1993, R, 2.8 and 94% from that HTML above.

Any help would be great as I don't have much knowledge when it comes to forming one of these things.

jaraics
  • 4,239
  • 3
  • 30
  • 35
StealthRT
  • 10,108
  • 40
  • 183
  • 342
  • 3
    I'd suggest not using regex for a task like this. Read [this question](http://stackoverflow.com/questions/516811/how-do-you-parse-an-html-in-vb-net) on HTML parsing in .NET. – darioo Apr 04 '11 at 12:23
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Matt Ball Apr 04 '11 at 12:23
  • @Matt Ball - How is it a duplicate? – Kobi Apr 04 '11 at 12:24
  • @Kobi it's just the archetypal "Don't use regex to parse (X)HTML" question on SO. – Matt Ball Apr 04 '11 at 12:30

2 Answers2

3

Don't use a regular expression to parse HTML. Use an HTML parser. There is a good one here.

tster
  • 17,883
  • 5
  • 53
  • 72
3

If you already have the HTML in a string:

string html = @"
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
";

Or you can load a page from the internet directly (saves you from 5 lines of streams and requests):

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.rottentomatoes.com/m/source_code/");

Using the HTML Agility Pack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection spans = doc.DocumentNode.SelectNodes("//span");

Now you can iterate over them, or simply get the text of each node:

IEnumerable<string> texts = spans.Select(option => option.InnerText).ToList();

Alternatively, you can search for the node you're after:

HtmlNode nodeReleaseYear = doc.DocumentNode
                              .SelectSingleNode("//span[@class='releaseYear']");
string year = nodeReleaseYear.InnerText;
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • The code is C#, but it should be easy enough to convert to VB.Net. – Kobi Apr 04 '11 at 12:34
  • How do i get it working? I put **HtmlDocument doc = new HtmlDocument()** but it has it underlined saying **HtmlDocument' is a type and cannot be used as an expression** and **Name 'doc' is not declared.** – StealthRT Apr 04 '11 at 12:42
  • @StealthRT, have you added a reference to HtmlAgilityPack in your project? – tster Apr 04 '11 at 12:52
  • @tster. Yes, i have **Imports HtmlAgilityPack** – StealthRT Apr 04 '11 at 12:56
  • @StealthRT - I'm not sure what the VB script should look like. I'd *guess* `Dim doc as new HtmlDocument` . Maybe this will help: http://stackoverflow.com/questions/516811/how-do-you-parse-an-html-in-vb-net/2604974#2604974 – Kobi Apr 04 '11 at 13:22