1

I want to read below XML Response but it is giving an error.

<html>
<head>
    <title>OK</title>
</head>
    <body>
    <h1>OK</h1>
    <table>
        <tbody>
            <tr>
                <td>Status</td>
                <td><div id="Status">200</div></td>
            </tr>
            <tr>
                <td>Message</td>
                <td><div id="Message">Page created</div></td>
            </tr>
            <tr>
                <td>Location</td>
                <td><a href="/content/parentnode/demopage" id="Location">/content/parentnode/demopage</a></td>
            </tr>
            <tr>
                <td>Parent Location</td>
                <td><a href="/content/parentnode" id="ParentLocation">/content/parentnode</a></td>
            </tr>
            <tr>
                <td>Path</td>
                <td><div id="Path">/content/parentnode/demopage</div></td>
            </tr>
            <tr>
                <td>Referer</td>
                <td><a href="" id="Referer"></a></td>
            </tr>
            <tr>
                <td>ChangeLog</td>
                <td><div id="ChangeLog">&lt;pre&gt;&lt;/pre&gt;</div></td>
            </tr>
        </tbody>
    </table>
    <p><a href="">Go Back</a></p>
    <p><a href="/content/parentnode/demopage">Modified Resource</a></p>
    <p><a href="/content/parentnode">Parent of Modified Resource</a></p>
    </body>
</html>

I am trying to read the "Page created" message with the below code

Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .parse(new InputSource(new StringReader(response.toString())));

        NodeList nodes = doc.getElementsByTagName("div");
        if (nodes.getLength() > 0) {
            Element ele = (Element) nodes.item(0);
            System.out.println("Page created -"
                    + ele.getElementsByTagName("//div[contains(@id,'Message')]").item(0).getTextContent());
        } else {    
        }

[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
    at working.OkhttpCreatePage.main(OkhttpCreatePage.java:40)

Line number 40 is .parse(new InputSource(new StringReader(response.toString())));

What am I doing wrong?

paul
  • 4,333
  • 16
  • 71
  • 144
  • 1
    Does this answer your question? [org.xml.sax.SAXParseException: Content is not allowed in prolog](https://stackoverflow.com/questions/5138696/org-xml-sax-saxparseexception-content-is-not-allowed-in-prolog) – Curiosa Globunznik Nov 16 '20 at 09:00

2 Answers2

2

HTML code you're parsing can be parsed by Java DOM parser, but it could be happy coincidence: another HTML response could contain some markup, which would be invalid from XML point of view. If you're 100% sure, that responses will come in XML/ XHTML format, that shouldn't be the problem, otherwise it would make sense to switch to JSoup parser, as suggested in another answer.

As for Content is not allowed in prolog error, it could come from whitespaces or another characters before the actual XML document beginning. You could try trim string before parsing it, or substring it from first < character to the end.

Also please note, that your XPath logic is a bit incorrect. Here is corrected version:

Document doc = DocumentBuilderFactory.newInstance()
            .newDocumentBuilder()
            .parse(new InputSource(new StringReader(xml)));

    NodeList nodes = doc.getElementsByTagName("div");
    if (nodes.getLength() > 0) {
        Element ele = (Element) nodes.item(0);
        System.out.println("Page created - "
                + XPathFactory.newInstance().newXPath().evaluate("//div[contains(@id,'Message')]", ele));
    }
Alexandra Dudkina
  • 4,302
  • 3
  • 15
  • 27
  • Make sure all tags have their matching closing tags also – JCompetence Nov 16 '20 at 09:31
  • Still giving same error. I am using `.parse(new InputSource(new StringReader(response.toString())));` – paul Nov 16 '20 at 09:44
  • Did you trimmed response? `String content = response.toString().trim();` ? – Alexandra Dudkina Nov 16 '20 at 09:55
  • Yes, I tried that too. Is it working for you? – paul Nov 16 '20 at 10:02
  • When I take your HTML from `` to `` it's parsed successfully. I would check in debug mode the content of response. It could contain some other characters in the beginning, – Alexandra Dudkina Nov 16 '20 at 10:11
  • Instead of using `response.toString()`, I tried `response.asString()` and same code worked. I can't `Answer your question` button but just in the interest of people who are seeing this post and wondering what they might be doing wrong can try this. – paul Nov 22 '20 at 19:26
0

An XML Document is supposed to start with

<?xml version="1.0" encoding="UTF-8"?>

and ends with

</xml>

Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog

This error means a few things:

  • Your document has spaces at the beginning of <?xml?
  • The document you are reading is encrypted/compressed or has characters that are not understood from the default encoding.

In your case, you are trying to parse an HTML document, yes it has markup elements but it is not an XML document.

I suggest you look into a library that handles such documents if you really want to read HTML.

To read an actual HTML document as String:

https://jsoup.org/cookbook/introduction/parsing-a-document

To read the HTML page directly from the web/or as response:

https://www.baeldung.com/java-with-jsoup

JCompetence
  • 6,997
  • 3
  • 19
  • 26
  • 1
    In addition, a valid HTML does not mean valid XML too. Some HTML tag does not have related end tag (For example: ,
    , etc..).
    – sigur Nov 16 '20 at 09:16
  • My XML doesn't start with `` but it is an XML. so is there is another way to read it. To remove spaces I used `response.body().toString().trim();` – paul Nov 16 '20 at 09:24