Jackson/Woodstox XML Encoded Character Interpretation

Question

I have been handed an XML file with instruction to read, edit, and write it using Jackson and Woodstox (as per the recommendation in the documentation). For the most part this has not been too hard; They're both pretty darn good at what it does. At this point, though, I have run into a problem:

My XML objects do themselves contain XML objects. For example:

<XMLObject>
    <OuterObject attributeOne="1" attributeTwo="2" attributeThree="&gt;">
        <InnerObject>&lt;NestedObject&gt;Blah&lt;/NestedObject&gt;</InnerObject>
    </OuterObject>
    <OuterObject attributeOne="11" attributeTwo="22" attributeThree="&lt;">
        <InnerObject>&lt;NestedObject&gt;Blah&lt;/NestedObject&gt;</InnerObject>
    </OuterObject>
    <OuterObject attributeOne="111" attributeTwo="222" attributeThree="3" />
<XMLObject>

The moment that I read the XML file into my Jackson-annotated Java object, all of those instances of < and > are converted by Woodstox into < and >, respectively. When I write the object back out as an XML file, < becomes < but > stays >

<XMLObject>
    <OuterObject attributeOne="1" attributeTwo="2" attributeThree=">">
        <InnerObject>&lt;NestedObject>Blah&lt;/NestedObject></InnerObject>
    </OuterObject>
    <OuterObject attributeOne="11" attributeTwo="22" attributeThree="&lt;">
        <InnerObject>&lt;NestedObject>Blah&lt;/NestedObject></InnerObject>
    </OuterObject>
    <OuterObject attributeOne="111" attributeTwo="222" attributeThree="3" />
<XMLObject>

The simplest version of my method that is endeavoring to read the file is as follows:

@RequestMapping("readXML")
public @ResponseBody CustomXMLObject readXML() throws Exception {
    File inputFile = new File(FILE_PATH);
    XmlMapper mapper = new XmlMapper();
    CustomXMLObject value = mapper.readValue(inputFile, CustomXMLObject .class);

    return value;
}

And my Jackson-annotated Java object would look something like this for the example that I gave above:

import com.fasterxml.jackson.annotation.JsonInclude;
import com.fasterxml.jackson.dataformat.xml.annotation.JacksonXmlProperty;

@JsonInclude(JsonInclude.Include.NON_NULL)
public class CustomXMLObject {
    @JacksonXmlProperty(isAttribute=true)
    private long attributeOne;
    @JacksonXmlProperty(isAttribute=true)
    private String attributeTwo;
    @JacksonXmlProperty(isAttribute=true)
    private String attributeThree;
    @JacksonXmlProperty(localName = "InnerObject")
    private String innerObject;


    public long getAttributeOne() {
        return attributeOne;
    }

    public void setAttributeOne(long attributeOne) {
        this.attributeOne = attributeOne;
    }

    public String getAttributeTwo() {
        return attributeTwo;
    }

    public void setAttributeTwo(String attributeTwo) {
        this.attributeTwo = attributeTwo;
    }

    public String getAttributeThree() {
        return attributeThree;
    }

    public void setAttributeThree(String attributeThree) {
        this.attributeThree = attributeThree;
    }

    public String getInnerObject() {
        return innerObject;
    }

    public void setInnerObject(String innerObject) {
        this.innerObject = innerObject;
    }
}

Finally, my dependencies look like this:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-test</artifactId>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>com.jayway.jsonpath</groupId>
    <artifactId>json-path</artifactId>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>com.fasterxml.jackson.module</groupId>
    <artifactId>jackson-module-jaxb-annotations</artifactId>
    <version>2.5.0</version>
</dependency>
<dependency>
        <groupId>com.fasterxml.jackson.dataformat</groupId>
        <artifactId>jackson-dataformat-xml</artifactId>
        <version>2.8.4</version>
</dependency>
<dependency>
    <groupId>org.codehaus.woodstox</groupId>
    <artifactId>woodstox-core-asl</artifactId>
    <version>4.4.1</version>
</dependency>

This appears to be occurring due to Jackson's use of Woodstox' BufferingXmlWriter. This particular writer will intercept those characters and encode them, and there does not appear to be any way to circumvent that decision:

private final void writeAttrValue(String value, int len) throws IOException {
    int inPtr = 0;
    char qchar = this.mEncQuoteChar;
    int highChar = this.mEncHighChar;

    while(true) {
        String ent = null;

        while(true) {
            if(inPtr >= len) {
                return;
            }

            char c = value.charAt(inPtr++);
            if(c <= 60) {
                if(c < 32) {
                    if(c == 13) {
                        if(this.mEscapeCR) {
                            break;
                        }
                    } else {
                        if(c == 10 || c == 9 || this.mXml11 && c != 0) {
                            break;
                        }

                        c = this.handleInvalidChar(c);
                    }
                } else {
                    if(c == qchar) {
                        ent = this.mEncQuoteEntity;
                        break;
                    }

                    if(c == 60) {
                        ent = "&lt;";
                        break;
                    }

                    if(c == 38) {
                        ent = "&amp;";
                        break;
                    }
                }
            } else if(c >= highChar) {
                break;
            }

            if(this.mOutputPtr >= this.mOutputBufLen) {
                this.flushBuffer();
            }

            this.mOutputBuffer[this.mOutputPtr++] = c;
        }

        if(ent != null) {
            this.writeRaw(ent);
        } else {
            this.writeAsEntity(value.charAt(inPtr - 1));
        }
    }
}

So to sum up the problem at the end, I have been given an XML file. That XML file contains attributes and elements that, themselves, contain symbols (< and >) that have been encoded (< and >) so as not to break the XML. When Woodstox reads the file, instead of handing my Java object the actual string contained in the XML, it decodes the character. Upon writing, only < is re-encoded as <. This appears to be happening because Jackson is using Woodstox' BufferingXmlWriter, which does not seem to be configurable to avoid encoding these characters.

As a result, my question is the following:

Can I configure the Jackson object to use a Woodstox XML reader that will allow my to read and write the characters in my XML file without further encoding, or do I need to look into a different solution entirely for my needs?

Just one more comment: character `>` does NOT have to be escaped in XML, unless it is part of sequence of `]]` followed by `>`. So while it may be preferable to force that to always be escaped, that is not required by XML specification, nor should any XML-aware tool care deeply where `>` comes in escaped or not. So it'd be good to know what the original issue with lack of escaping was. — StaxMan, Oct 26 '18 at 02:06

StaxMan · Answer 1 · 2016-11-14T15:44:34.490

1

You can configure underlying XMLOutputFactory2 to use CharacterEscapes, which can specify override to what gets escaped by default. Would this:

http://www.cowtowncoder.com/blog/archives/2012/08/entry_476.html

work?

EDIT: apologies for suggestion above -- this does not work with XML, just JSON. I should have double-checked it. While there is a work item to make it also work with XML, this does not exist yet (as of november 2016).

edited Nov 14 '16 at 15:44

answered Nov 11 '16 at 16:09

StaxMan

113,358
34
211
239

1

Unfortunately, no. As per https://github.com/FasterXML/jackson-dataformat-xml/issues/75 the current build does not support the Character Escape feature. I have tried to implement that regardless to no avail. – Matthew Snook Nov 11 '16 at 18:06
@MatthewSnook yes and no: current version does not support convenient access; but there is nothing preventing you from providing specifically configured `XMLOutputFactory2` for `XmlMapper` (or maybe via `XmlFactory`, I forget the details). So there's bit more wiring to use, but it should be doable. – StaxMan Nov 11 '16 at 19:22
Hm....well, then, I need a bit more guidance. Simply following the process from that blog post does not seem to actually do anything to the read/write data. – Matthew Snook Nov 11 '16 at 22:42
Ah, got pointed towards http://forum.spring.io/forum/spring-projects/web-services/53022-stax-endpoint-response-and-special-characters by the Jackson Google group; will be trying it out on Monday. Not quite the same as the Character Escapes, but along the same lines. – Matthew Snook Nov 12 '16 at 00:55
Hrm. Upon further inspection the text escape property exists only on the Output factory, and I need the data to remain unchanged at read for that to be entirely relevant. Lazy parsing sounded promising at first, but obviously didn't pan out. – Matthew Snook Nov 17 '16 at 00:31
@MatthewSnook right; XML is really not designed to be extractable, or to retain exact representation. About the only way to guarantee identical output to input would be to use canonical XML output (serialization) -- there is a specification for that. But then again such output usually requires tree model (DOM) to produce I think... canonical xml is mostly used for security purposes, to calculate signature/hash to be able to ensure that a document's logical content has not been changed. – StaxMan Nov 17 '16 at 00:35
1

Bah, yeah, that's the impression that I am getting from all this. I suppose that what I want is technically possible, but essentially unfeasible. If I wrote my own XML reader I could force it to happen but frankly I don't have the time or the inclination. I have already implemented a stopgap measure of forcefully reinterpreting the relevant symbols on object creation via a replace in the setter for those specific fields. Crude, but it's gotten the job done. – Matthew Snook Nov 17 '16 at 17:49
@MatthewSnook right. I have to admit I do not quite understand your original problem: at logical level contents remain unchanged, and physical escaping details should not matter for any XML processor. But I assume there are some practical reasons for this to matter in your particular case. – StaxMan Nov 20 '16 at 05:07

Jackson/Woodstox XML Encoded Character Interpretation

1 Answers1