-1

Possible Duplicate:
What are the pros and cons of the leading Java HTML parsers?

What HTML parser would you recommend for parsing HTML? I need one feature html parser to have: That parser returns useful text, no menu, no footer, no headers information. Only text that contains normal content.

I have tried Jericho Html parser, HtmlCleaner but they do not seem to work as I need.

Thanks in advance.

Community
  • 1
  • 1
Paulius
  • 9
  • 2

1 Answers1

2

I'm not really sure what you're asking; an HTML parser parses HTML--what you extract out of it is up to you. I like jsoup and tagsoup.

If you want something that pulls "normal" content out of HTML, you could look at how Apache Tika handles HTML. All HTML is written differently--you have to be able to define what "normal" content is, and where it is.

Dave Newton
  • 158,873
  • 26
  • 254
  • 302
  • I have found incredible parser, exactly what I was looking for. You can check it your self its open source: http://boilerpipe-web.appspot.com/ – Paulius Oct 27 '11 at 17:34
  • @Paulius That looks pretty cool; similar to what Tika does. Thanks for the reference. – Dave Newton Oct 27 '11 at 17:36