Fatal Error while parsing web page with javax.xml.parsers.DocumentBuilder

Question

I'm writing a program which parses a web page (one which I don't have access to so I can't modify it).

First I connect and use getContent() to get an InputStream for the page. There's no trouble there.

But then when parsing:

    public static int[] parseMoveGameList(InputStream is) throws ParserConfigurationException, IOException, SAXException {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = dbf.newDocumentBuilder();
        Document doc = builder.parse(is);
        /*...*/
    }

Here builder.parse throws:

org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 64; The system identifier must begin with either a single or double quote character.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:253)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:288)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at cs.ualberta.lgadapter.LGAdapter.parseMoveGameList(LGAdapter.java:78)
    ...

The page that I'm parsing (but can't change) looks like

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >









<html>
<head>
<META http-equiv="Expires" content="0" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<!-- ...  -->
</head>
<body>
<!-- ...  -->
</body>
</html>

How can I get past this exception?

I don't think it's a good idea to use an XML parser to parse HTML. — Alex, Aug 10 '12 at 17:01
http://stackoverflow.com/questions/9071568/parse-web-site-html-with-java — Alex, Aug 10 '12 at 17:07

score 2 · Accepted Answer · edited Apr 07 '13 at 05:43

Html is not valid xml. Using an xml parser to parse html will probably result in a lot of errors(as you have already discovered).

The reason your html is failing is because of your Doctype declaration:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >

xml parsers expect the 'PUBLIC' doctype declaration to look like the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "FALLBACK PATH TO DTD" >

If you can't change the html page, I am not sure there is much you can do about this. Maybe you can the modify/wrap your input stream to add some dummy data to make it conform to what is expected, or remove the doctype declaration.

You should use a HTML parsing library instead. I do not know of any off the top of my head, but this (older) post seems to have a couple listed. http://www.benmccann.com/blog/java-html-parsing-library-comparison/ . Searching Google also comes back with http://jsoup.org/

Fatal Error while parsing web page with javax.xml.parsers.DocumentBuilder

1 Answers1