3

I have an old Java application that processes XML from a third-party data feed.

The data feed allows user-input, and it is now suddenly containing emojis such as �� (). I'm actually surprised it took this long for this problem to appear (emojis have been around for a few years now).

The app blows up in javax.xml.parsers.DocumentBuilder.parse(InputStream):

org.xml.sax.SAXParseException; lineNumber: 105; columnNumber: 3039; Character reference "&#
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)

Is there a quick, localized fix that I can apply without having to redesign and rearchitect the whole application? Also, would prefer to avoid a regex search/replace hack since that can introduce other subtle problems.

Alex R
  • 11,364
  • 15
  • 100
  • 180

1 Answers1

6

�� is a single character encoded as a surrogate pair (two surrogates). A character reference in XML cannot represent a (high or low) surrogate: these are not legal characters. The character reference should represent the Unicode codepoint of the Emoji as a whole, 👇.

The third party is sending you invalid XML, and you should reject it as you would reject any other faulty goods from a supplier.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Hi Michael, I'm a fan and still have your XSLT book on my shelf! Can you explain the algorithm that yields 👇 from ? – Alex R Oct 29 '18 at 15:47
  • 1
    @AlexR see https://stackoverflow.com/questions/8868432/how-are-surrogate-pairs-calculated – Michael Kay Oct 29 '18 at 22:06