0

I am consuming a SOAP API in an application. I have some boiler-plate code to process the API response into a java object like this:

//first I Remove the soap wrapper: 
String soapResponse = this.callApiEndpointByPage(someIncrementingInt);
ByteArrayInputStream inputStream = new ByteArrayInputStream(soapResponse.getBytes());
SOAPMessage message = MessageFactory.newInstance(SOAPConstants.SOAP_1_2_PROTOCOL).createMessage(null, inputStream);
message.setProperty("Content-Type" ,"text/xml; charset=utf-8"); 
Document doc = message.getSOAPBody().extractContentAsDocument();///<<--- Exception thrown here! 

// Then I initiate an unmarshaller:
JAXBContext context = JAXBContext.newInstance(myPojo.class);
Unmarshaller um = context.createUnmarshaller();     

// Then I unmarshall the XML to a POJO:
MyPojo myPojo = (MyPojo) um.unmarshal(doc);

The API endpoint I'm hitting is paginated. for 99/100 pages, the above code works perfectly. However, when processing some pages, this exception is thrown:

Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.

Having investigated the SOAP responses more closely, it appears that some of the data contained in the XML, is itself escaped XML. It looks sort of like this:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <SomeXMLParentObject xmlns="http://url-endpoint.com/webservices/">
            <SomeXMLChildObject>
                &lt;?xml version="1.0" encoding="utf-16"?&gt;
                &lt;Records count="50000"&gt;
                &lt;someEscapedDataINeedLater&gt;
                tonnes of escaped XML here
                &lt;/someEscapedDataINeedLater&gt;
            </SomeXMLChildObject>
        </SomeXMLParentObject>
    </soap:Body>
</soap:Envelope>

Note how the response encoding is UTF-8, but the escaped XML that it contains is UTF-16. All of the pages have this - but not all throw exceptions.

I suspect that there may be some seldom-used UTF-16 characters are being allowed by the software which provides the API - and those are causing problems.

However, I cannot figure out how to force my code to expect UTF-16. No matter what I do, the errror messages specify that they're expecting a "3-byte UTF-8 sequence".

In the code above, I explicitly state utf-8:

message.setProperty("Content-Type" ,"text/xml; charset=utf-8"); 

However, changing this to UTF-16 does nothing. Inspecting the SOAPMessage shows it is still expecting 'application/xml'.

Questions:

  • How can I make the code above expect UTF-16 instead of UTF-8

  • Will that fix the exception I'm getting?

Edit: The solution in the possible duplicate question seems to be to change how the XML is produced - that does not apply in my case, as I am consuming an API and have no control over how the XML is formed.

Edit 2:

I figured out that I could set the character encoding when getting the bytes from the String, like this:

            ByteArrayInputStream inputStream = new ByteArrayInputStream(soapResponse.getBytes(Charsets.UTF_16));

However, this causes me to hit another issue:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 39; Content is not allowed in prolog.

Googling around, that seems to be caused by the UTF-8 leading character:

"Another thing that often happens is a UTF-8 BOM (byte order mark), which is allowed before the XML declaration can be treated as whitespace if the document is handed as a stream of characters to an XML parser rather than as a stream of bytes." 

which leads me to beleive that just changing the character encoding to UTF-16 for the entire app isn't the solution.

Any ideas how I can get this working for those pages with odd characters?

Community
  • 1
  • 1
Paul
  • 3,318
  • 8
  • 36
  • 60
  • Possible duplicate of [MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence](https://stackoverflow.com/questions/9920758/malformedbytesequenceexception-invalid-byte-2-of-2-byte-utf-8-sequence) – Omar Himada Apr 06 '18 at 14:54
  • Hi @Adosi the solution in that question seems to be to modify the XML to produce it in a different way. I am consuming an API - I have no control over how the XML is formed. – Paul Apr 06 '18 at 15:17
  • Found any solution? – Hari Ram Mar 17 '20 at 10:54

0 Answers0