0

Currently I'm trying to parse some html and return an array with the values inside each element.

For example:

if I pass the below markup into a function

var element = "td";
var html = "<tr><td>1</td><td>2</td></tr>";
return Regex.Split(html, string.Format("<{0}*.>(.*?)</{0}>", element));

And I'm expecting back an array[] { 1, 2 }

What does my regex need to look like? Currently my array is coming back with far to many elements and my regex skills are lacking

Toran Billups
  • 27,111
  • 40
  • 155
  • 268
  • 6
    [Parsing (X)HTML with RegEx!?!!!!???](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) That joke never gets old, does it? – dtb Sep 27 '10 at 20:37
  • 2
    Before you continue down this path, read this (edit - dtb beat me to it) – Donut Sep 27 '10 at 20:39

3 Answers3

6

Do not parse HTML using regular expressions.

Instead, you should use the HTML Agility Pack.

For example:

HtmlDocument doc = new HtmlDocument();
doc.Parse(str);

IEnumerable<string> cells = doc.DocumentNode.Descendants("td").Select(td => td.InnerText);
SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
1

You really should not use regex to parse html. html is not a regular language, so regex isn't capable of interpreting it properly. You should use a parser.

c# has html parsers for this.

JoshD
  • 12,490
  • 3
  • 42
  • 53
0

The method to load the html has changed since the original answer, it is now:

// From File
var doc = new HtmlDocument();
doc.Load(filePath);

// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);

// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

However if you follow the documentation as per the provided link above you should be fine :)

MikeDub
  • 5,143
  • 3
  • 27
  • 44