Java library for cleaning up HTML just like a browser would

Question

So here's the challenge... I need to create clean HTML from random web pages out there in the wild. My goal is to read in a page and pass it off to a library which will in turn give me back perfectly well-formed HTML.

Doesn't sound so tough, right? After all, every browser on the market effectively deals with the challenge of malformed HTML and turning it into something render-able with nearly every page load. Each has its own slightly particular algorithm for cleaning up the contents (ahem...for HTML < 5 that is), but they tend to do a very good job of capturing what i like to refer to as the author's intention. So then, why can't I find a good java library for this very task?

One thing to mention is that I'm not at all interested in parsing the HTML as XML. I've found that libraries such as NekoHTML, TagSoup, HtmlCleaner, and JTidy (to name a few) are more focused on solving the problem of converting to HTML to valid XML, and in the process, they lose sight of how the poorly-formatted document should be re-structured. With nasty HTML they frequently don't capture the author's intention and spit out documents that render quite differently from the original source. And for this project, it's of the utmost importance that the two documents render similarly.

I am quite fond of Jericho HTML, but it doesn't seem to be the ideal candidate for this job...at least not without a lot of effort on my part. Also, Native dependencies are a no-go, so the mozilla parser is out.

Can anyone help me in my search for the perfect HTML parser? Thanks in advance!

Define to me badly formatted HTML... do you refer to proper indentations or just malformed HTML in general. The latter is something much larger in scope. — maple_shaft, May 24 '11 at 15:48
We're definitely talking the latter. For this project, I have to pull in a web page, apply a number of transformations, and then display the result to the user. If the formatting/layout is significantly different than the pre-processed HTML, I'm in trouble. — stevevls, May 24 '11 at 15:52

score 7 · Accepted Answer · edited May 23 '17 at 12:24

7

JSoup I would say

See Also

which-html-parser-is-best

edited May 23 '17 at 12:24

Community

1
1

answered May 24 '11 at 15:45

jmj

237,923
42
401
438

Thanks for the link to the other question. I've seen that one before, and chased down some of the links, though I left with the conclusion that my problem here is slightly different. I'm evaluating JSoup now, and it's looking very promising. I had previously skipped over it b/c the name was so close to TagSoup that I thought they were the same. ;) – stevevls May 24 '11 at 16:50
2

After a day of using it, I can officially say that this library rocks. Thanks so much! – stevevls May 25 '11 at 16:00

score 1 · Answer 2 · answered May 24 '11 at 16:44

1

I have used HTML Tidy in the past.

answered May 24 '11 at 16:44

Chris Nava

6,614
3
25
31

Huh...it looks like they released a version in 2009 for the first time in 8 years. I have used Tidy in the past too, but I've was underwhelmed even then. Maybe about five years back, I started a project (this time where I was parsing HTML -> XML) and ended up using NekoHTML instead (which is also kind of dead now). – stevevls May 24 '11 at 18:26

score 0 · Answer 3 · answered May 24 '11 at 16:30

0

TagSoup?

answered May 24 '11 at 16:30

user240515

3,056
1
27
34

1

Thanks, but unfortunately that library falls into the camp of "let's turn HTML into XML." I already evaluated it and tossed it out b/c it was giving me HTML that rendered differently than the source. – stevevls May 24 '11 at 18:20

Java library for cleaning up HTML just like a browser would

3 Answers3