Possible to parse a HTML document and build a DOM tree(java)

Question

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API.

For example:

DomRoot = parse("myhtml.html");

for (tags : DomRoot) {
}

Note: this is a HTML document not XHtml.

please include "parsing" as a tag too – JuanZe Sep 16 '09 at 14:22 — JuanZe, Sep 16 '09 at 14:22

score 4 · Answer 1 · answered Sep 16 '09 at 14:49

4

You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

answered Sep 16 '09 at 14:49

Thiyagaraj

3,585
1
18
15

1

TagSoup is very good, especially if you have to parse crappy HTML – Pascal Thivent Sep 16 '09 at 14:59

Andy · Answer 2 · 2009-09-16T14:48:52.587

2

JTidy should let you do what you want.

Usage is fairly straight forward, but parsing is configurable. e.g.:

InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();

The JavaDoc is hosted here.

edited Sep 16 '09 at 14:48

answered Sep 16 '09 at 14:23

Andy

8,870
1
31
39

score 1 · Answer 3 · answered Sep 16 '09 at 14:19

1

You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.

It is distributed under the Apache 2.0 license.

answered Sep 16 '09 at 14:19

Guido

46,642
28
120
174

score 0 · Answer 4 · answered Sep 16 '09 at 14:20

0

HTML Parser seems to support conversion from HTML to XML. Then you can build a DOM tree using the usual Java toolchain.

answered Sep 16 '09 at 14:20

Johannes Weiss

52,533
16
102
136

score 0 · Answer 5 · edited May 23 '17 at 10:27

0

There are several open source tools to parse HTML from Java.

Check http://java-source.net/open-source/html-parsers

Also you can check answers to this question: Reading HTML file to DOM tree using Java It is almost the same...

edited May 23 '17 at 10:27

Community

1
1

answered Sep 16 '09 at 14:21

JuanZe

8,007
44
58

Possible to parse a HTML document and build a DOM tree(java)

5 Answers5

Linked

Related