Is there a Standard Java SE HTML Parser? If so, why use non-standard ones?

Question

I need to parse a simple HTML page with a simple form in it. The answers to similar questions on StackOverflow suggest using one of a large variety of non-standard Java libraries such as TagSoup, JSoup, HTMLParser and many others.

However, a web search revealed that there exists some standard functionality in Java SE via this class: http://docs.oracle.com/javase/7/docs/api/javax/swing/text/html/parser/ParserDelegator.html

My sub-questions are:

Is it really true that the standard ParserDelegator class can parse a use case like mine?
What are the limitations of the standard library that create the need for so many non-standard libraries?
Does the fact that ParserDelegator is within swing preclude using it in a regular EC2 cloud server for a web application? Would I have to jump through a lot of hoops to get around the headless aspect or would it be just a small tweak to the configuration?
If the standard one is not recommended, which non-standard one should I use, given: (a) my desire to not stray far from the standard; (b) my simple use case; (c) desire for a mature reliable implementation; and (d) no size or weight limitations since this is a server application as opposed to an embedded client. API is a far lower priority so while I do appreciate JSoup's CSS selector like API, the other concerns (a) through (d) override it.

Thank you.

close voter(s), please point to what this is a duplicate of (if that's your reason for a close vote) — necromancer, Jan 31 '12 at 07:23

score 4 · Accepted Answer · edited May 23 '17 at 10:27

4

JDK has built-in HTML parser that supports HTML 1.0 or so. It should support parsing of base text formatting tags and forms.

The reason to use other, third party parsers is requirement to support "real" HTML pages DHTML, JavaScript etc.

JSoup is one of popular parsers that can do the job. For more information about other implementations please take a look on the following discussion:

Pure Java HTML viewer/renderer for use in a Scrollable pane

edited May 23 '17 at 10:27

Community

1
1

answered Jan 31 '12 at 07:24

AlexR

114,158
16
130
208

1

Thanks - I would rephrase it has built-in parser not being able to handle anything but ancient versions of HTML. The link is not helpful. It is about viewers / renderers whereas I need a parser. – necromancer Jan 31 '12 at 10:08

Is there a Standard Java SE HTML Parser? If so, why use non-standard ones?

1 Answers1