3

I want to parse a HTML file using Java and i have used DocumentBuilder class for it. My HTML contains a <img src="xyz"> tag, without a closing </img> tag,which is allowed in browser.But when i give it to DocumentBuilder for parsing it gives me this error

The element type "img" must be terminated by the matching end-tag </img>.

Java :

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document document = docBuilder.parse(is);

What should i do to get rid of this error?

Yagnesh Agola
  • 4,556
  • 6
  • 37
  • 50
Vallabh Lakade
  • 722
  • 7
  • 22
  • `The element type "img" must be terminated by the matching end-tag "".` You probably need valid html to parse it. All tags must have ending part, or at least be defined as `` – Jakuje Aug 11 '15 at 09:36
  • 1
    HTML *isn't* XML and isn't subject to the same validation – Brian Agnew Aug 11 '15 at 09:38
  • @Jakuje but without a closing tag is a valid html.For ex : http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_image_test – Vallabh Lakade Aug 11 '15 at 09:38
  • libxml2 doesn't have this problem. It shuts up about the Official Rules and just parses that HTML, subject to varying levels of validation... – Phlip Jul 11 '20 at 13:17

2 Answers2

5

DocumentBuilder is part of Java's XML parsing framework. An XML parser will not correctly parse HTML: the languages look similar, but XML has stricter requirements. (You've already seen one of the differences: in XML, all tags should have a matching end tag, while in HTML some tags do and some don't.)

Try a HTML parser instead. I've heard good things about jsoup (http://jsoup.org/).

Wander Nauta
  • 18,832
  • 1
  • 45
  • 62
0

You can also use TagSoup to parse HTML as if it were XML, though that will give you SAX rather than DOM.