13

I want to parse some HTML in order to find the values of some attributes/tags etc.

What HTML parsers do you recommend? Any pros and cons?

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
pek
  • 17,847
  • 28
  • 86
  • 99

3 Answers3

12

NekoHTML, TagSoup, and JTidy will allow you to parse HTML and then process with XML tools, like XPath.

jelovirt
  • 5,844
  • 8
  • 38
  • 49
7

I have tried HTML Parser which is dead simple.

pek
  • 17,847
  • 28
  • 86
  • 99
  • I have used HTML parser on a project and it worked exactly as expected – Craig Angus Sep 27 '08 at 00:21
  • 1
    but there is not much tutorials available... – Lily Jul 07 '09 at 14:25
  • I've noticed a lot of javascript snippets (and element attributes) creeping into my supposedly "text node" extractions. There have also been some cases where malformed HTML caused the whole parsing operation to fail. So I'm looking to replace the htmlparser library in my own project with something a little better. – benjismith Mar 16 '11 at 18:02
1

Do you need to do a full parse of the HTML? If you're just looking for specific values within the contents (a specific tag/param), then a simple regular expression might be enough, and could very well be faster.

Herms
  • 37,540
  • 12
  • 78
  • 101