1

Is this valid XML data (the value of the messageContent in particular)?

I am getting it from an API.

I then get an error when I pass this XML down to a Postgres function for saving to the Postgres DB.

<rows>

<row messageDateUTC="2020-06-01T21:20:37.120" 

texterAddress="" texterStreet="" messageContent="Hey beautiful it&apos;s Scott!&#55357;&#56842;"  />


</rows>

I wonder if it's an API issue, or a problem with the client-side module which generates the XML, or maybe Postgres has an issue and is not able to handle these characters.

Error here:

Caused by: org.postgresql.util.PSQLException: ERROR: invalid XML content
  Detail: line 5: xmlParseCharRef: invalid xmlChar value 55357
ddress="" texterStreet="" messageContent="Hey beautiful it&apos;s Scott!&#55357;
                                                                               ^
line 5: xmlParseCharRef: invalid xmlChar value 56842
" texterStreet="" messageContent="Hey beautiful it&apos;s Scott!&#55357;&#56842;
                                                                               ^
line 23: chunk is not well balanced
peter.petrov
  • 38,363
  • 16
  • 94
  • 159
  • Please see also: https://stackoverflow.com/questions/63133697/java-read-utf-8-file-with-a-single-emoji-symbol – peter.petrov Jul 28 '20 at 12:14
  • Also check [this answer](https://stackoverflow.com/a/20805244) regarding the choice of XML 1.0 or XML 1.1. – jbatista Mar 18 '21 at 09:17

1 Answers1

4

tl;dr No, they are not valid, whatever did the encoding is either buggy or got told wrong encoding information about the input.

55357 and 56842 are 0xD83D and 0xDE0A in hex respectively.

In Unicode they are in ranges called "High Surrogate" and "Low Surrogate" respectively.

That means that they are not proper Unicode codepoints, but rather used in UTF-16 to construct a single Unicode value that doesn't fit into 16 bit (i.e. the Basic Multilingual Plane).

These two specific values decode to U+1F60A SMILING FACE WITH SMILING EYES. The correct decimal HTML entity for that would be &#128522;.

The most likely reason for this is that some transformation that either doesn't know about UTF-16 or thought this text is not UTF-16 did the encoding (but should have detected that those values are invalid and reported an error even in that case).

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • Hm... Yes, I noticed this is some smile/emoji symbol. Thanks a lot... I will do some more research then to see who produces these, it's kind of hard to trace but let's see. – peter.petrov Jul 28 '20 at 10:18
  • The XML response from the API says encoding UTF-8... but seems to me the emoji is encoded in UTF-16. How can I verify that? In other words, how should that emoji have been encoded in UTF-8 (which the response claims its encoding is)? – peter.petrov Jul 28 '20 at 10:33
  • It's possible that the API already provides wrong data. A common mistake is to encode UTF-16 encoded data again as UTF-8, but treating it as if it was UCS-2 (which is the fixed-width 2-byte encoding that was prevalent before UTF-16 got widespread adoption). If that is the case then you should see the bytes `0xED 0xA0 0xBD` in the byte stream from the API. The best approach would be to fix the API, but you *might* be able to post-process your data to fix this issue (since it should not lose any information). – Joachim Sauer Jul 28 '20 at 10:40
  • Actually, if the API already provides XML, then it's possible that those bytes don't show up and it reports the text exactly as you post above: as ASCII-compatible characters forming a decimal XML entity (i.e. ``). If that shows up then technically the API already delivers malformed XML. If either that *or* the bytes mentioned above show up in the API then the API is at fault. – Joachim Sauer Jul 28 '20 at 10:42
  • No, the API provides a different XML, this one which I presented here - it's an XML format which I form from the API's raw format. You see... this response goes through multiple layers so it's tricky. E.g. I use Jersey on the client side to read the XML. Is it possible I need to configure Jersey somehow? I doubt it because the raw XML from the API seems to say UTF-8. So it must be UTF-8. The raw XML that I get from the API - I am not sure even how to see what raw bytes it has there for that emoji. I will think some more. – peter.petrov Jul 28 '20 at 10:50
  • The API is not in my control, it's a 3rd party one. You mentioned something about post-processing the data to fix this issue. Any ideas how can I do this? – peter.petrov Jul 28 '20 at 11:00
  • Seems the API is sending these raw bytes for that emoji symbol: F0 9F 98 8A. Is this in UAT-8? I think not and that's the problem probably. Can one even encode this emoji symbol in UTF-8? I am quite confused now. – peter.petrov Jul 28 '20 at 11:27
  • Oh, that's exactly the UTF-8 sequence of this symbol! Thanks. So the API is sending me good data. Right? – peter.petrov Jul 28 '20 at 11:30
  • Please see also: https://stackoverflow.com/questions/63133697/java-read-utf-8-file-with-a-single-emoji-symbol – peter.petrov Jul 28 '20 at 12:14