HTML/XML Parser for Java

Question

What HTML parsers have the following features:

Fast
Thread-safe
Reliable and bug-free
Parses HTML and XML
Handles erroneous HTML
Has a DOM implementation
Supports HTML4, JavaScript, and CSS tags
Relatively simple, object-oriented API

What parser you think is better?

Thank you.

what do you mean by "support HTML4, javascript and CSS" ? A parser is just that, a parser, it won't interpret your page. If you want to simulate a browser, please rephrase your question. — Valentin Rocher, Jan 24 '10 at 23:37
No. Some parsers do not understand things like CSS. This is what I mean. — Shayan, Jan 24 '10 at 23:41

Cesar · Answer 1 · 2010-01-25T20:40:22.053

16

Check out Web Harvest. It's both a library you can use and a data extraction tool, which sounds to me that's exactly what you want to do. You create XML script files to instruct the scraper how to extract the information you need and from where. The provided GUI is very useful to quickly test the scripts.

Check out the project's samples page to see if it's a good fit for what you are trying to do.

edited Jan 25 '10 at 20:40

answered Jan 25 '10 at 00:16

Cesar

5,488
2
29
36

2

+1 for Web Harvest -- if you are trying to do page scraping it is the way to go. – jckdnk111 Jan 25 '10 at 02:34

Valentin Rocher · Answer 2 · 2010-01-24T23:39:32.357

7

The best known are NekoHTML and JTidy.

NekoHTML is based on Xerces, and provides a simple adaptable SAXParser which implements XMLReader JavaSE interface.

JTidy is more intented into formatting your html code into something XML-valid, but is still very useful as an XML parser, producing a DOM tree if needed.

You could have a look at this list for other alternatives.

Another choice could be to use hpricot through jRuby.

edited Jan 24 '10 at 23:39

answered Jan 24 '10 at 23:32

Valentin Rocher

11,667
45
59

Why? which features do they have? – Shayan Jan 24 '10 at 23:33
SAX is not what I want and main purpose of JTidy is cleaning an XML. Are you sure that it does what i want better than others? – Shayan Jan 24 '10 at 23:40
what do you want exactly, then ? – Valentin Rocher Jan 24 '10 at 23:42
It should be DOM based. and I want extraction as it's main job not transformation. – Shayan Jan 24 '10 at 23:45
It doesn't matter what its "main" job is, as long as it does what you want it to do. – Anon. Jan 25 '10 at 00:08

score 6 · Answer 3 · answered Jan 25 '10 at 09:50

6

Validator.nu's HTML parser, definitely. It's an implementation of the HTML5 parsing algorithm, and Gecko is in the process of replacing its own HTML parser with a C++ translation of this one.

answered Jan 25 '10 at 09:50

Ms2ger

15,596
6
36
35

score 5 · Accepted Answer · edited Aug 31 '12 at 23:02

5

Apache Tika is the best choice. Apache has recently extracted many sub-projects out of the existing projects and made them public. Tika is one of them that was previously a component of Apache Lucene. Because of Apache's support and reputation and the widely-used parent project Lucene it must be a very good choice. Furthermore, it is open-source.

A brief introduction from Apache Tika web site:

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

And the supported formats are:

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

edited Aug 31 '12 at 23:02

reevesy

3,452
1
26
23

answered Mar 30 '12 at 02:52

Shayan

2,758
7
36
55

1

Apache Tika is an excellent suggestion. Even if you are not interested in reading XML/HTML/MS DOC formats you can just specify "text/plain". It will stream in the data so it doesn't need to preload the whole file first. List of benefits: http://tika.apache.org/1.4/parser.html Article with sample code: http://www.openlogic.com/wazi/bid/314389/Content-mining-with-Apache-Tika – Salvador Valencia Dec 03 '13 at 21:48
1

I came for a solid HTML Parser, and left with one that I won't have to spend the time to generalize. I love this game. – Inversus Jan 18 '14 at 09:00

score 1 · Answer 5 · answered Jan 24 '10 at 23:35

Well:

there aren't so many good HTML parsers in java as you need, but here are some alternatives: http://java-source.net/open-source/html-parsers

Very few of them support Javascript. Actually, I think you'll have to do this part on your own using Rhino (http://www.mozilla.org/rhino/).

score 1 · Answer 6 · answered Jan 24 '10 at 23:47

1

I think that HTML Cleaner is what you're looking for. See its announcement on TheServerSide to see how it compare to JTidy, TagSoup, NekoHtml.

answered Jan 24 '10 at 23:47

Pascal Thivent

562,542
136
1,062
1,124

But this is also for transforming into well XMLs. My main goal is extracting data from it. – Shayan Jan 24 '10 at 23:53
@Shayan So what? Doesn't it allow you to extract data from it? Doesn't it offer DOM manipulation? Doesn't it allow to parse nasty HTML? I don't get you. – Pascal Thivent Jan 25 '10 at 01:00

score 1 · Answer 7 · answered Jan 24 '10 at 23:57

1

you probably want to look at doing something like running Mozilla in headless mode. Here is a link to get you started, I am sure you can use Google to find out more information.

answered Jan 24 '10 at 23:57

HTML/XML Parser for Java

7 Answers7

Linked