What's so special about Unicode character "" that it breaks the parser logic based on curly braces?

Question

I am trying to debug a weird issue, hoping a Unicode expert here would be able to help.

I have a (Perl based) sender program, which takes some data structure
it encodes the data structure into a proprietary serialized format which uses curly braces for encoding the data. Here's an example serialized string: {{9}{{8}{{skip_association}{{0}{}}}{{data}{{9}{{1}{{exceptions}{{9}{{1}{{-472926}{{9}{{1}{{AAAAAAYQ2}
it then sends that serialized string to a Java server
Java server tries to de-serialize the string back into a data structure.
The encoding does not really matter too much (imho) other than it uses field length as part of encoded data; e.g. {{id}{{7}9{Z928D2AA2}}} means a field named "id", of type "string" (7), length of string 9, value Z928D2AA2.

Problem: When the data structure being serialized contains some specific Unicode character(s), the de-serialization fails.

Specifically, this character: "" (which various online decoders display as `%82` or `0x82`) causes the issue.

I'm trying to understand why this would be an issue and what's so special about this character - there are other Unicode characters that do not break the de-serializer.

Is there something special about (aka 0x82) Unicode character that would interfere with parsing a serialized string dependent on curly braces as separators and field lengths being known?

Unfortunately, I am unable to debug the decodig library, so I only get a generic error message that decoding failed without any idea what about it failed.

P.P.S Double extra curious: when I used that character in the title of SO question, it printed in the preview, but got deleted when the question was posted!!! When I tried to copy/paste the strings into the editor, their measured length was correct compared to encoded string length

P.S. The Perl code doing the serialization as far as I know is fully Unicode compliant:

use open      qw(:std :utf8);    # undeclared streams in UTF-8
use charnames qw(:full :short);  # unneeded in v5.16
use Encode qw(decode);

It's really impossible to say without knowing anything about the serialization format or implementation. — Grinnz, Jun 12 '19 at 22:02
@Grinnz - I'm hoping this Unicode character is something special (like, equivalent to a closing curly brace or something; or has weird length calculations) — DVK, Jun 12 '19 at 22:04
The only thing special about this character vs other Unicode characters is that it can be represented in cp1252 (the native single byte encoding of most US systems). — Grinnz, Jun 12 '19 at 22:06
"fully Unicode compliant" is not really a phrase that makes sense. You must encode and decode your data exactly where appropriate and nowhere else, there is no magic Unicode compliance feature. If you are using these characters literally in your source code, you need `use utf8;`. — Grinnz, Jun 12 '19 at 22:07
I think the main question I would have is what encoding this serialization is expected to be transmitted in. — Grinnz, Jun 12 '19 at 22:10
it is a "other"-class control character; it doesn't surprise me that it something expecting text data doesn't like it. other than that, it's just this character, you know. its Bidirectional Category is "Boundary Neutral", which is normal for an "other" or format type control character. — ysth, Jun 12 '19 at 22:11
@Grinnz - no, they come from opening a file, so `use utf8` is not needed. Good point in general though :) — DVK, Jun 12 '19 at 22:13
@ysth - if you fill that out, i think it may possibly be a good (or probably THE correct) answer. Is there a way I can find a list of such characters, and a way to match them in Perl? I can then test if other characters of that class fail same way — DVK, Jun 12 '19 at 22:14
Can you verify the decoded character via your Perl code: `sprintf '%vX', $char` or `ord $char` (for the decimal ordinal), as it may be different after it's been serialized and encoded in whatever way. — Grinnz, Jun 12 '19 at 22:15
Also, I'd recommend `:encoding(UTF-8)` instead of `:utf8`, as the latter is an internal use layer which can end up creating an invalid string if you feed it garbage. — Grinnz, Jun 12 '19 at 22:22
@Grinnz - no change from :encoding(UTF-8) but backwards decoding produces decidedly different character from 2 unicode characters when printing — DVK, Jun 12 '19 at 22:29
I don't know what you mean by "backwards decoding". You can't print unicode characters, they must always be encoded to something for serialization, and the bytes will be different if they are not ASCII characters. That's why you should verify what the decoded character is. — Grinnz, Jun 12 '19 at 22:53
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/194846/discussion-between-dvk-and-grinnz). — DVK, Jun 13 '19 at 00:01
And after all the digging, looks like the issue is an innocuous E2 character and not control 82 :( I'm confused to no end — DVK, Jun 13 '19 at 00:06
U+00E2, or `â`, is commonly part of the result of double encoding, since the byte `\xE2` starts many UTF-8 sequences.. Perhaps the decoder is errantly guessing the data is double encoded and failing when it isn't. — Grinnz, Jun 13 '19 at 03:50
@Grinnz - seems to be the root cause. It was a Euro 3-byte character — DVK, Jun 13 '19 at 18:19

score 3 · Accepted Answer · answered Jun 12 '19 at 22:44

3

You can see information about characters in the unicode character database; a text dump of that can be found at https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt where it shows:

0082;<control>;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;;

The meanings of the fields can be found at http://www.unicode.org/reports/tr44/#UnicodeData.txt (though that seems to omit the first field, which is the codepoint).

So it is an "other" class control character, with Bidirectional Category "Boundary Neutral" (which is normal for a Cc or Cf class character). There isn't anything else special about it.

But being a control character, it doesn't surprise me that something expecting text data has a problem with it.

answered Jun 12 '19 at 22:44

ysth

96,171
6
121
214

1

And after all the digging, looks like the issue is an innocuous E2 character and not control 82 :( I'm confused to no end – DVK Jun 13 '19 at 00:06
@DVK, That's rather incredible. U+00E2 is "â", a perfectly ordinary letter character. The Java program should have no problems with that specific character, and SO wouldn't delete it. On the other hand, the error message says `82`, and it would make sense for SO to delete a control character. I believe this new discovery is a mistake. Combined with the fact that `E2` can be the lead byte of a 3-byte UTF-8 sequence (for a character in U+2000..U+2FFF), it seems like you are looking at the encoded form of a character. – ikegami Jun 13 '19 at 03:52
@ikegami - I was too hasty. It wasn't E2 itself - it was a combo of E2 followed by either 82 or A3 (or some other characters, I'm not sure what the pattern is now). E2 by itself as a single character is fine. I'm guessing that may be related to the 3-byte idea you presented – DVK Jun 13 '19 at 17:20
@ikegami - it was a bloody Euro character :( It's actually the first example on the Wiki for 3 byte encoding! – DVK Jun 13 '19 at 18:18
@DVK, Yeah, like I said, you were reporting the partial encoded form of a Code Point, and not a Code Point itself. € is U+20AC and encodes to E2 82 AC – ikegami Jun 13 '19 at 18:58
@DVK, There is an `82` in the encoding of the Euro character, so either you have a double-encoded character, or the Java program is not expecting UTF-8. I suspect the former. In short, the problem is that you don't have actually have € because you have an extraneous encode or a missing decode somewhere. – ikegami Jun 13 '19 at 18:59
@DVK you need to find out what encoding the java program is expecting (if in fact it allows non-ascii at all) – ysth Jun 13 '19 at 19:06
@ikegami - when I run decode(UTF-8) on the string, Java stops failing, BUT the resulting string ends up being "?" instead if Euro once it's stored in database by Java program. – DVK Jun 13 '19 at 19:20
@DVK, The fact that you can decode suggests that you should be. The fact that it doesn't work either doesn't mean it's wrong; it could simply be indicative of another bug or a limitation. But, I'm just speculating here. – ikegami Jun 13 '19 at 19:32
@DVK maybe encode into latin15? but instead of trying stuff, find out for sure what your other program is expecting; it may be that a euro character is not even possible – ysth Jun 13 '19 at 22:27
@ysth - that's my problem :( The other program is extremely difficult to debug. I'll try to in the coming week. – DVK Jun 14 '19 at 00:33

What's so special about Unicode character "" that it breaks the parser logic based on curly braces?

Problem: When the data structure being serialized contains some specific Unicode character(s), the de-serialization fails.

Specifically, this character: "" (which various online decoders display as %82 or 0x82) causes the issue.

1 Answers1

Specifically, this character: "" (which various online decoders display as `%82` or `0x82`) causes the issue.