0

You get a string, containing any kind of characters (UTF-8) including special characters like emoticons/emoji . You have to generate an XML Element containing that received string and pass it to an XSLT Transformator Engine.

As I get Transformation Errors, I wonder how the Java code could process the string before inserting it into the final XML so that the XSLT Transformation will not fail.

What I currently have in Java is this:

String inputValue = ...; // you get this string by an external client
Element target = ...; // element of an XML where you have to add the string
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]"; // this removes the illegal characters in XML
inputValue = inputValue.replaceAll(xml10pattern, "");
target.setAttribute("text", inputValue);

But still, is something missing in order to make it more safe?

basZero
  • 4,129
  • 9
  • 51
  • 89

2 Answers2

1

Apache commons library has StringEscapeUTils.escapeXML(string). This allows to have & in your attribute.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • 1
    no it removes only illegal XML characters, see the "NOT" (^) at the beginng... I took it from http://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java/4237934#4237934 – basZero Apr 27 '16 at 11:01
0

A cheap possibility would be to strip off all non ASCII characters so that you just pass a clean text string to it (but with linebreaks etc.):

String inputValue = ...; // you get this string by an external client
Element target = ...; // element of an XML where you have to add the string
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]"; // this removes the illegal characters in XML
inputValue = inputValue.replaceAll(xml10pattern, "");
inputValue = inputValue.replaceAll("[^\\x00-\\xFF]", "");
target.setAttribute("text", inputValue);

Any thoughts on this?

basZero
  • 4,129
  • 9
  • 51
  • 89