Is there a class I can use to extract elements from messy HTML

Question

I've got a requirement to grab text out of some pretty messy html. Lets say I need the 3rd list item from the first list in the page. There may or may not be closing tags on the li's, they may be in mixed cases, have classes etc.

I was wondering if, in a console application, is is possible to use a class (DOMDocument???) to load the HTML into a DOM, which would atleast sanitize it somewhat, then parse it out of there.

This seems like something that should be solved already, but I've not found anything too relevant except this vintage regex solution http://www.vsj.co.uk/articles/display.asp?id=389

Any thoughts on if this is a good approach and the correct classes to investigate would be appreciated.

Check out http://stackoverflow.com/questions/653357/html-parsing-libraries-for-net - The answer there i.e to use `HTMLAgilityPack` is the most common and easiest approach that i know of. — Jagmag, Jan 22 '11 at 13:49

score 4 · Accepted Answer · answered Jan 22 '11 at 13:50

4

The Html Agility Pack can be used to work with 'messy' Html in a DOM fashion.

answered Jan 22 '11 at 13:50

Tim Lloyd

37,954
10
100
130

Do not even consider [using Regex to parse Html](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)! :) – Tim Lloyd Jan 22 '11 at 13:55
I wasn't going to. HTML is made almost entirely out of edge cases! – Andiih Jan 22 '11 at 14:20

Is there a class I can use to extract elements from messy HTML

1 Answers1