Parsing XHTML with java

Question

I need a little guidance with reading from a URL XHTML page in java:

Here's my best try to print a specific String:

    try {       
    URL item = new URL("url");
                URLConnection connect = item.openConnection();
                DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
                DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
                Document doc= dBuilder.parse(connect.getInputStream());
                doc.getDocumentElement().normalize();
                NodeList nList = doc.getElementsByTagName("tag");
                for(int temp = 0; temp<nList.getLength(); temp++) {
                    Node nNode = nList.item(temp);
                    if(nNode.getNodeType() == Node.ELEMENT_NODE) {
                        Element el = (Element) nNode;
        System.out.println((el.getElementsByTagName("wantedElement").item(0).getTextContent()));
                    }}
}catch(IOException | ParserConfigurationException | SAXException e) {
            e.printStackTrace();
            }

Response from Eclipse:

 [Fatal Error] :1:1: Content is not allowed in prolog.
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

XHTML sample I'm trying to parse (from TD Ameritrade API):

<CandleList>
<candles>
<candles>
<open>45.97</open>
<high>46.26</high>
<low>45.8</low>
<close>46.0</close>
<volume>7176781</volume>
<datetime>1496293200000</datetime>
</candles>
<candles>
<open>46.22</open>
<high>46.86</high>
<low>45.9</low>
<close>46.8</close>
<volume>9523927</volume>
<datetime>1496379600000</datetime>
</candles>

I appreciate any help!

If your XML's indentation is as messy as your code snippet's is, I have a clue about what causes the error :) — Jan B., Jun 06 '18 at 00:37
I wouldn't try parse XHTML pulled from some web site with the DOM API. Use jsoup instead. It's more forgiving. — jingx, Jun 06 '18 at 00:40
Posted text I'm trying to parse - sorry. Will try JSOUP. Thanks — Robert, Jun 06 '18 at 00:51
What you’re parsing is not XHTML but XML. And you need to post the whole response including the first few characters because that’s where the problem is (you also need to post the end because the xml you’ve posted is invalid because it doesn’t close all tags) — Erwin Bolwidt, Jun 06 '18 at 01:03
If you try to validate this *xml* using https://www.xmlvalidation.com/ it also fails — Scary Wombat, Jun 06 '18 at 02:44
I'm interested in the `catch(IOException | ParserConfigurationException | SAXException e)` bit. Can Java really do that? Maybe I should have kept up with the versions... — Mr Lister, Jun 06 '18 at 10:29

score 0 · Accepted Answer · answered Jun 06 '18 at 23:52

while the question has all the issues mentioned in the comment, the error at line 1 column 1 is about BOM at the beginning of the stream.

Some services, especially .Net services send BOM at the beginning of the stream to mark the encoding, like UTF-8, UTF-16LE etc.

Byte order mark screws up file reading in Java

Parsing XHTML with java

1 Answers1