1

I've got a requirement to grab text out of some pretty messy html. Lets say I need the 3rd list item from the first list in the page. There may or may not be closing tags on the li's, they may be in mixed cases, have classes etc.

I was wondering if, in a console application, is is possible to use a class (DOMDocument???) to load the HTML into a DOM, which would atleast sanitize it somewhat, then parse it out of there.

This seems like something that should be solved already, but I've not found anything too relevant except this vintage regex solution http://www.vsj.co.uk/articles/display.asp?id=389

Any thoughts on if this is a good approach and the correct classes to investigate would be appreciated.

Andiih
  • 12,285
  • 10
  • 57
  • 88
  • 2
    Check out http://stackoverflow.com/questions/653357/html-parsing-libraries-for-net - The answer there i.e to use `HTMLAgilityPack` is the most common and easiest approach that i know of. – Jagmag Jan 22 '11 at 13:49

1 Answers1

4

The Html Agility Pack can be used to work with 'messy' Html in a DOM fashion.

Tim Lloyd
  • 37,954
  • 10
  • 100
  • 130
  • Do not even consider [using Regex to parse Html](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)! :) – Tim Lloyd Jan 22 '11 at 13:55
  • I wasn't going to. HTML is made almost entirely out of edge cases! – Andiih Jan 22 '11 at 14:20