0

This is an XML document (the sentence and whitespace prior to the XML declaration and XSLT processing instruction are part of the input):

This XML file does not appear to have any style information associated with it. The document tree is shown below.


    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
      <mts:meta name="elapsed-time" value="18" />
      <exchange-documents>
        <exchange-document country="US" number="8049504">
        ....
        ....
        ....

        </exchange-document>
      </exchange-documents>

I am parsing the XML and using XPath. In most of the XML files, the first line contains some text or spaces (refer the above xml)

Without that leading text, it parses successfully, but if any text appears it produces the below error:

--- exec-maven-plugin:1.2.1:exec (default-cli) @ XMLHandling ---

[Fatal Error] :1:1: Content is not allowed in prolog.

How can I get around this?

The code that I am using:

public static void main(String[] args) throws ParseException {

        String filePath = "D:/newxml.xml";

        try {
            FileInputStream file = new FileInputStream(new File(filePath));
            DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = builderFactory.newDocumentBuilder();
             Document xmlDocument = builder.parse(file);
            XPath xPath = XPathFactory.newInstance().newXPath();

            String pubOrPatentNumber = xPath.compile("//preference").evaluate(xmlDocument);
            ...
            ...
            }
            }

I can manually remove the text and execute, but I need to solve this within my code to clean up the input automatically.

Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
Prabu
  • 3,550
  • 9
  • 44
  • 85
  • 1
    Most probably it is Byte Order Mark. See possible solution here: http://stackoverflow.com/questions/21891578/removing-bom-characters-using-java – Artyom Rebrov Jul 25 '16 at 07:51
  • 1
    On the code level, you could use the string library functions, i.e., look for the first occurence of "" in the input string containing the document, then take the substring starting here and then parse it. However, I would advise to proceed with caution because of the well-formedness errors. It is an established best practice to make sure that XML documents are always well-formed, to avoid such issues. I hope this helps! – Ghislain Fourny Jul 25 '16 at 13:07

1 Answers1

0

There are two issues in the document from a well-formedness perspective.

  1. It is not allowed to have two top-level elements (mts:meta, exchange-documents).

  2. The prefix mts is not declared.

This amended document is well-formed (but one needs to adapt the namespace URI for mts, and to pick the appropriate name for the wrapping element):

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<root>
    <mts:meta xmlns:mts="http://www.example.com" name="elapsed-time" value="18" />
    <exchange-documents>
        <exchange-document country="US" number="8049504">
            ....
            ....
            ....
        </exchange-document>
    </exchange-documents>
</root>
Adam Lear
  • 38,111
  • 12
  • 81
  • 101
Ghislain Fourny
  • 6,971
  • 1
  • 30
  • 37
  • The Xml File comes from http i.e i'm parsing the xml on the fly hit the Url, create an connection and so on, each and every Xml the first line should be "This XML file does not appear to have any style information associated with it. The document tree is shown below." because of the first line i'm not able to parse the document, as well as not able to update the xml – Prabu Jul 25 '16 at 12:07
  • 1
    Thank you for reverting, Prabu and sorry about that. I thought it was a copy-and-paste artefact. Then this is one more issue. Also, if this is a document retrieved via HTTP, then it means that something is wrong on the server serving this XML, unless maybe if it is supposed to be an XML fragment, not a document. Is this sentence displayed by a browser, as browsers usually add bells and whistles when displaying XML? If so, can you try to look at, and share, the actual source code? Browsers normally let you view the raw XML. – Ghislain Fourny Jul 25 '16 at 12:55