0

SOLUTION So this was not an xml issue at all. My xml escapes were done properly, however there was an encoding issue. So i would like to share my solution with everyone, i hope you find this useful.

public static String entityEncode(String text) throws UnsupportedEncodingException {
    String result = text;

    if (result == null) {
        return result;
    }
    byte ptext[] = result.getBytes("ISO-8859-1"); 
    String value = new String(ptext, "UTF-8"); 
    String temp = XMLStringUtil.escapeControlChrs(value);

    return temp;
}

EXPLANATION The xml function above is for XML 1.0. We take our given text, convert it into a byte since String does not have an associated encoding. After which we create a new string off of the byte in "UTF-8". That is also why java was just telling me that character reference error with &#, it couldn't recognize the character at fault. Now that I did the encoding and assigned it to UTF-8, there are no issues and the xml escape proceeds properly!

EDIT: How do i print out all illegal xml characters in the provided string? According to StringEscapeUtils.escapeXml parameters? The problem i have is that i don't want to escape everything, because it doesn't properly decode after. So right now, i just need to find out what my invalid characters in the text are. The oens that are causing issues and need to be encoded.

I have the following error message:

ERROR:  'Character reference "&#'
ERROR:  'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Character reference "&#'

It does not specifically tell me what the character is which is a problem.

I do my original XML parse to convert to an xml document and then after that. I sanitize further to remove illegal characters

String xml10pattern = "[^"
    + "\u0009\r\n"
    + "\u0020-\uD7FF"
    + "\uE000-\uFFFD"
    + "\ud800\udc00-\udbff\udfff"
    + "]";

However, it's not removing them so i'm not sure how to go about this. Currently i have:

String temp = entityEncode(temp);
String legal = temp.replaceAll(xml10pattern , "");
item.setResponseBody(legal);

Entity encode just uses a standard xml parse class to escape characters XMLStringUtil.escapeControlChrs which is based off of StringEscapeUtils.escapeXml and just has additional escapes, nothing removed. But something is being missed.

codeCompiler77
  • 508
  • 7
  • 22

0 Answers0