SOLUTION So this was not an xml issue at all. My xml escapes were done properly, however there was an encoding issue. So i would like to share my solution with everyone, i hope you find this useful.
public static String entityEncode(String text) throws UnsupportedEncodingException {
String result = text;
if (result == null) {
return result;
}
byte ptext[] = result.getBytes("ISO-8859-1");
String value = new String(ptext, "UTF-8");
String temp = XMLStringUtil.escapeControlChrs(value);
return temp;
}
EXPLANATION The xml function above is for XML 1.0. We take our given text, convert it into a byte since String does not have an associated encoding. After which we create a new string off of the byte in "UTF-8". That is also why java was just telling me that character reference error with &#, it couldn't recognize the character at fault. Now that I did the encoding and assigned it to UTF-8, there are no issues and the xml escape proceeds properly!
EDIT: How do i print out all illegal xml characters in the provided string? According to StringEscapeUtils.escapeXml
parameters? The problem i have is that i don't want to escape everything, because it doesn't properly decode after. So right now, i just need to find out what my invalid characters in the text are. The oens that are causing issues and need to be encoded.
I have the following error message:
ERROR: 'Character reference "&#'
ERROR: 'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Character reference "&#'
It does not specifically tell me what the character is which is a problem.
I do my original XML parse to convert to an xml document and then after that. I sanitize further to remove illegal characters
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
However, it's not removing them so i'm not sure how to go about this. Currently i have:
String temp = entityEncode(temp);
String legal = temp.replaceAll(xml10pattern , "");
item.setResponseBody(legal);
Entity encode just uses a standard xml parse class to escape characters XMLStringUtil.escapeControlChrs
which is based off of StringEscapeUtils.escapeXml
and just has additional escapes, nothing removed. But something is being missed.