0

I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?

I've tried Cobra's built in one and HTMLCleaner without any luck.

Legend
  • 113,822
  • 119
  • 272
  • 400
  • Judging by your last question, the problem isn't with "XPath evaluator". You were using `XPathFactory.newInstance()`, which creates the stock Java evaluator that works on any XML document loaded in a DOM model (as instance of `Document`). CORBA itself isn't an XPath evaluator - it's an HTML parser which produces `Document`, and it did that wrong in your case. So what you actually want is a "good Java HTML parser", not "good Java XPath evaluator". – Pavel Minaev Nov 26 '09 at 23:55
  • Oops... sorry. I've revised my question... I'm just going nuts with all the HTML in front of my eyes... – Legend Nov 27 '09 at 00:05
  • I'm sure this same question was on SO earlier this week... – DisgruntledGoat Nov 27 '09 at 00:36

5 Answers5

4

TagSoup is really great when dealing with crappy HTML/XHTML.

Jericho (and NekoHTML) are good too to parse non valid HTML.

TagSoup and Jericho: tried-and-tested. NekoHTML: feedback from trustable source.

Pascal Thivent
  • 562,542
  • 136
  • 1,062
  • 1,124
1

Take a look at Saxon (no, I'm not involved in any way with the product, just a satisfied user).

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
1

Mozilla HTML Parser looks rather interesting. By definition, it's supposed to be as good as Gecko engine itself, which is likely to cover your needs.

Pavel Minaev
  • 99,783
  • 25
  • 219
  • 289
1

[Answering the title - the overall question and comments are not consistsent]

JTidy (http://jtidy.sourceforge.net/) is a port of Dave Raggett's HTMLTidy. It's very useful though I think development may have slowed/ceased.

peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217
1

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)

Ms2ger
  • 15,596
  • 6
  • 36
  • 35