How to parse html and return array of values in c# using regex.split

Question

Currently I'm trying to parse some html and return an array with the values inside each element.

For example:

if I pass the below markup into a function

var element = "td";
var html = "<tr><td>1</td><td>2</td></tr>";
return Regex.Split(html, string.Format("<{0}*.>(.*?)</{0}>", element));

And I'm expecting back an array[] { 1, 2 }

What does my regex need to look like? Currently my array is coming back with far to many elements and my regex skills are lacking

[Parsing (X)HTML with RegEx!?!!!!???](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) That joke never gets old, does it? — dtb, Sep 27 '10 at 20:37
Before you continue down this path, read this (edit - dtb beat me to it) — Donut, Sep 27 '10 at 20:39

score 6 · Accepted Answer · answered Sep 27 '10 at 20:37

6

Do not parse HTML using regular expressions.

Instead, you should use the HTML Agility Pack.

For example:

HtmlDocument doc = new HtmlDocument();
doc.Parse(str);

IEnumerable<string> cells = doc.DocumentNode.Descendants("td").Select(td => td.InnerText);

answered Sep 27 '10 at 20:37

SLaks

868,454
176
1,908
1,964

score 1 · Answer 2 · answered Sep 27 '10 at 20:38

1

You really should not use regex to parse html. html is not a regular language, so regex isn't capable of interpreting it properly. You should use a parser.

c# has html parsers for this.

answered Sep 27 '10 at 20:38

JoshD

12,490
3
42
53

score 0 · Answer 3 · answered Jun 20 '19 at 23:02

The method to load the html has changed since the original answer, it is now:

// From File
var doc = new HtmlDocument();
doc.Load(filePath);

// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);

// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

However if you follow the documentation as per the provided link above you should be fine :)

How to parse html and return array of values in c# using regex.split

3 Answers3