UnicodeString w/ String Literals vs Hex Values

Question

Is there any conceivable reason why I would see different results using unicode string literals versus the actual hex value for the UChar.

UnicodeString s1(0x0040); // @ sign
UnicodeString s2("\u0040");

s1 isn't equivalent to s2. Why?

@moshbear: Do you have a link to the API reference? This should be straight-forward to sort out. — Kerrek SB, Nov 16 '11 at 00:51
@KerrekSB http://icu-project.org/apiref/icu4c/classUnicodeString.html — moshbear, Nov 16 '11 at 00:58
Hm, the literal `"\u0040"` is just not well-defined (that is, it's implementation-defined). So I guess we can't answer that in general. If it were a UTF-8 string (`u8"\u0040"`) we might be in better shape. — Kerrek SB, Nov 16 '11 at 01:08
@KerrekSB By implementation, you mean by the compiler or by the library in question (ICU in this case)? — Ternary, Nov 16 '11 at 01:57
@Ternary: I'd guess that something like the compiler's execution character set would play a role. Good compilers let you configure that. In any event, you just shouldn't be using `\u` and `\U` escape sequences in this context. Here's a [previous post](http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c0x) of mine on the subject. — Kerrek SB, Nov 16 '11 at 02:03
@KerrekSB For conversation's sake, what if the \u value were read from a file at runtime? So if you had \u0040 in a file that was read into a UnicodeString at runtime, how does that change behavior? Because the results are different but I'm not sure why. — Ternary, Nov 16 '11 at 02:06
@Ternary: That doesn't make sense. `\u` is an *escape sequence*, which is a lexical feature of the C++ grammar used for literal values. You cannot "read it from a file". (De)serialiasation always requires that you document the format. — Kerrek SB, Nov 16 '11 at 02:10
@KerrekSB Well using the ICU bundling you can have resource files that are key-value pairs in the format of `keyname {"some text \u0040"}` and exact the text for the key into an UnicodeString. — Ternary, Nov 16 '11 at 02:56
@Ternary: It still doesn't make sense. If you're reading it from a file, it's just data, and perhaps ICU comes with a parser for that. But that's not the same as an escape-sequence literal in the source code. That's like saying if you read a string `"terminate()"` then your program stops... — Kerrek SB, Nov 16 '11 at 02:58
@KerrekSB But the UnicodeString ctor is just taking a char * (I believe) which is ready character by character out of the file (I assume), or maybe you're right and ICU has a parser for that. I'm saying it *does* work though. Just another data point in an issue that is perplexing me. — Ternary, Nov 16 '11 at 03:04
@Ternary: There is a crucial difference between `UnicodeString("\u0040")` and `UnicodeString("\\u0040")`! — Kerrek SB, Nov 16 '11 at 03:13
I understand that. I'm just saying ICU supports "\u0040" as a value in a resource bundle http://userguide.icu-project.org/locale/resources — Ternary, Nov 16 '11 at 03:25
@KerrekSB Looks like it is done at runtime by the ICU `Since ICU is not a compiler extension, the "unescaping" is done at runtime and the backslash itself must be escaped (duplicated) so that the compiler does not attempt to "unescape" the sequence itself.` From http://userguide.icu-project.org/strings — Ternary, Nov 16 '11 at 17:42
@KerrekSB Bingo. I found the reason why in their doc. Thanks so much for your help and time. — Ternary, Nov 16 '11 at 18:24

score 1 · Accepted Answer · answered Jun 20 '12 at 03:41

The \u escape sequence AFAIK is implementation defined, so it's hard to say why they are not equivalent without knowing details on your particular compiler. That said, it's simply not a safe way of doing things.

UnicodeString has a constructor taking a UChar and one for UChar32. I'd be explicit when using them:

UnicodeString s(static_cast<UChar>(0x0040));

UnicodeString also provide an unescape() method that's fairly handy:

UnicodeString s = UNICODE_STRING_SIMPLE("\\u4ECA\\u65E5\\u306F").unescape(); // 今日は

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

couldn't reproduce on ICU 4.8.1.1

#include <stdio.h>
#include "unicode/unistr.h"

int main(int argc, const char *argv[]) {
  UnicodeString s1(0x0040); // @ sign
  UnicodeString s2("\u0040");
  printf("s1==s2: %s\n", (s1==s2)?"T":"F");
  //  printf("s1.equals s2: %d\n", s1.equals(s2));
  printf("s1.length: %d  s2.length: %d\n", s1.length(), s2.length());
  printf("s1.charAt(0)=U+%04X s2.charAt(0)=U+%04X\n", s1.charAt(0), s2.charAt(0));
  return 0;
}

=>

s1==s2: T

s1.length: 1 s2.length: 1

s1.charAt(0)=U+0040 s2.charAt(0)=U+0040

gcc 4.4.5 RHEL 6.1 x86_64

score 0 · Answer 3 · answered Nov 16 '11 at 18:25

For anyone else who find's this, here's what I found (in ICU's documentation).

The compiler's and the runtime character set's codepage encodings are not specified by the C/C++ language standards and are usually not a Unicode encoding form. They typically depend on the settings of the individual system, process, or thread. Therefore, it is not possible to instantiate a Unicode character or string variable directly with C/C++ character or string literals. The only safe way is to use numeric values. It is not an issue for User Interface (UI) strings that are translated.

[1] http://userguide.icu-project.org/strings

on some platforms you could do L"...." to get a unicode string. But as it said, it's unspecified. — Steven R. Loomis, Nov 17 '11 at 04:49

Gnawme · Answer 4 · 2011-11-16T01:37:00.597

-1

The double quotes in your \u constant are the problem. This evaluated properly:

wchar_t m1( 0x0040 );
wchar_t m2( '\u0040' );
bool equal = ( m1 == m2 );

equal was true.

edited Nov 16 '11 at 01:37

answered Nov 16 '11 at 00:41

Gnawme

2,321
1
15
21

1

I can't find anything in the C++11 standard that backs this up. Do you have a reference? – Kerrek SB Nov 16 '11 at 01:07
@KerrekSB: If you're asking about the reserved area, I believe that's specific to ICU (with which I'm only barely familiar). – Gnawme Nov 16 '11 at 01:22
I'm going to downvote this. I don't think this applies to the question. Also, the character pseudo-literal `'\u0040'` is no better defined that the string pseudo-literal `"\u0040"`; both are implementation- and context-dependent, and should not be used in that way at all. – Kerrek SB Nov 16 '11 at 01:28
@KerrekSB: The C++03 Standard, section 2.2, Character sets: "The _universal-character-name_ construct provides a way to name other characters. [e.g. \u hex-quad or \U hex-quad hex-quad]. The character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN." In what way do you mean `implementation-dependent'? – Gnawme Nov 16 '11 at 01:36
I know what `\uXXXX` means. The problem is, what does `'\uXXXX'` mean? You see, the former is just an abstract value, but the latter is a value of a concrete **type**. And there's no universal rule how an arbitrary Unicode codepoint should turn into a `char` (which is the type of the `''` literal). Contrast this to, say, `U"\uXXXX"`, where the string consists of `char32_t`s and the semantics are standardized (so the string has two elements, the first is the 32-bit integer `0x0000XXXX`, and the second is zero). – Kerrek SB Nov 16 '11 at 01:43
In ICU, `'\u0040'` specifies the `'@'` character using a valid universal-character-name _escape sequence_. – Gnawme Nov 16 '11 at 07:16
'\u0040' will be the character value that represents U+0040 in whatever encoding is used for char if that encoding can represent U+0040, and if not then the value is implementation defined. So if you want to represent U+0040 in the system's char encoding then '\u0040' is the right way to do it. – bames53 Nov 16 '11 at 18:47

UnicodeString w/ String Literals vs Hex Values

4 Answers4

Linked