Specific Java HTML parser

Question

Possible Duplicate:
What are the pros and cons of the leading Java HTML parsers?

What HTML parser would you recommend for parsing HTML? I need one feature html parser to have: That parser returns useful text, no menu, no footer, no headers information. Only text that contains normal content.

I have tried Jericho Html parser, HtmlCleaner but they do not seem to work as I need.

Thanks in advance.

score 2 · Answer 1 · answered Oct 27 '11 at 16:49

2

I'm not really sure what you're asking; an HTML parser parses HTML--what you extract out of it is up to you. I like jsoup and tagsoup.

If you want something that pulls "normal" content out of HTML, you could look at how Apache Tika handles HTML. All HTML is written differently--you have to be able to define what "normal" content is, and where it is.

answered Oct 27 '11 at 16:49

Dave Newton

158,873
26
254
302

I have found incredible parser, exactly what I was looking for. You can check it your self its open source: http://boilerpipe-web.appspot.com/ – Paulius Oct 27 '11 at 17:34
@Paulius That looks pretty cool; similar to what Tika does. Thanks for the reference. – Dave Newton Oct 27 '11 at 17:36

Specific Java HTML parser

1 Answers1