HTML parsing using DOM-Java

Question

I want to parse a HTML file using Java and i have used DocumentBuilder class for it. My HTML contains a <img src="xyz"> tag, without a closing </img> tag,which is allowed in browser.But when i give it to DocumentBuilder for parsing it gives me this error

The element type "img" must be terminated by the matching end-tag </img>.

Java :

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document document = docBuilder.parse(is);

What should i do to get rid of this error?

`The element type "img" must be terminated by the matching end-tag "".` You probably need valid html to parse it. All tags must have ending part, or at least be defined as `` — Jakuje, Aug 11 '15 at 09:36
@Jakuje but without a closing tag is a valid html.For ex : http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_image_test — Vallabh Lakade, Aug 11 '15 at 09:38
libxml2 doesn't have this problem. It shuts up about the Official Rules and just parses that HTML, subject to varying levels of validation... — Phlip, Jul 11 '20 at 13:17

Wander Nauta · Answer 1 · 2015-08-11T09:40:59.437

5

DocumentBuilder is part of Java's XML parsing framework. An XML parser will not correctly parse HTML: the languages look similar, but XML has stricter requirements. (You've already seen one of the differences: in XML, all tags should have a matching end tag, while in HTML some tags do and some don't.)

Try a HTML parser instead. I've heard good things about jsoup (http://jsoup.org/).

edited Aug 11 '15 at 09:40

answered Aug 11 '15 at 09:34

Wander Nauta

18,832
1
45
62

Thanks i will try that.But are there any disadvantages of using jsoup? – Vallabh Lakade Aug 11 '15 at 09:41
I need to run XPath queries on the HTML. Just like (ahem) Gnome's libxml2 can do... – Phlip May 02 '20 at 04:04

score 0 · Answer 2 · answered Jan 19 '16 at 15:24

0

You can also use TagSoup to parse HTML as if it were XML, though that will give you SAX rather than DOM.

answered Jan 19 '16 at 15:24

Elliotte Rusty Harold

963
7
15

HTML parsing using DOM-Java

2 Answers2

Linked