0

I didn't find an explanation in the C standard how do aforementioned escape sequences in wide strings are processed.

For example:

wchar_t *txt1 = L"\x03A9";
wchar_t *txt2 = L"\xA9\x03";

Are these somehow processed (like prefixing each byte with \x00 byte) or stored in memory exactly the same way as they are declared here?

Also, how does L prefix operate according to the standard?

EDIT:

Let's consider txt2. How it would be stored in memory? \xA9\x00\x03\x00 or \xA9\x03 as it was written? Same goes to \x03A9. Would this be considered as a wide character or as 2 separate bytes which would be made into two wide characters?

EDIT2:

Standard says:

The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape sequence are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant. The numerical value of the hexadecimal integer so formed specifies the value of the desired character or wide character.

Now, we have a char literal:

wchar_t txt = L'\xFE\xFF';

It consists of 2 hex escape sequences, therefore it should be treated as two wide characters. If these are two wide characters they can't fit into one wchar_t space (yet it compiles in MSVC) and in my case this sequence is treated as the following:

wchar_t foo = L'\xFFFE';

which is the only hex escape sequence and therefore the only wide char.

EDIT3:

Conclusions: each oct/hex sequence is treated as a separate value ( wchar_t *txt2 = L"\xA9\x03"; consists of 3 elements). wchar_t txt = L'\xFE\xFF'; is not portable - implementation defined feature, one should use wchar_t txt = L'\xFFFE';

user206334
  • 850
  • 1
  • 8
  • 18
  • Concerning your edit: Who cares? `txt2` points to the first element of an array of three integers of type `wchar_t` with values 0xA9, 0x03 and 0x00, in that order. The representation of that type depends on your platform (and can be inspected by treating each integer as an array of bytes). – Kerrek SB Apr 07 '13 at 19:24
  • @KerrekSB: The question is tagged C. String literals are read-only (in the sense that modifying them has undefined behavior), but not `const`. `char *s = "hello";` is perfectly legal, but admittedly dangerous; there *should* be a `const`, but the compiler is not obliged to warn about it. – Keith Thompson Apr 07 '13 at 19:25
  • @KeithThompson: Oh, good point - I removed the comment. – Kerrek SB Apr 07 '13 at 19:25
  • You should be aware that the width of `wchar_t` is implementation-defined. It's commonly 16 bits on Windows, 32 bits on Linux and similar systems. – Keith Thompson Apr 07 '13 at 19:26
  • @KerrekSB I care, cause I construct UTF-8 and UTF-16 strings in memory and they should be stored in particular order so that the output would be valid (that's why I care about endianness and how oct/hex sequences are treated or massaged by the compiler). – user206334 Apr 07 '13 at 20:01
  • If all you want is to serialize and deserialize UTF-8 sequences, then that's a lot simpler than you're making it. For UTF-16, it's still easy, and you just need *one* point of contact with the wire format, which can be entirely separate from the text processing logic. – Kerrek SB Apr 07 '13 at 20:38
  • @KerrekSB I wanted to know what compiler does when it encounters hexadecimal escape inside L"" strings. Now I know that each hex sequence inside a string is treated as a separate wide character (this was my major issue) (several separate byte values are not concatenated into multibyte one as I imagined). Also I became aware that L'\xFE\xFF' is not part of standard (the concatenation performed is implementation defined) and I need to use L'\xFFFE' in wide character literals to get portable results. – user206334 Apr 07 '13 at 20:50

1 Answers1

2

There's no processing. L"\x03A9" is simply an array wchar_t const[2] consisting of the two elements 0x3A9 and 0, and similarly L"\xA9\x03" is an array wchar_t const[3].

Note in particular C11 6.4.4.4/7:

Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.

And also C++11 2.14.3/4:

There is no limit to the number of digits in a hexadecimal sequence.

Note also that when you are using a hexadecimal sequence, it is your responsibility to ensure that your data type can hold the value. C11-6.4.4.4/9 actually spells this out as a requirement, whereas in C++ exceeding the type's range is merely "implementation-defined". (And a good compiler should warn you if you exceed the type's range.)


Your code doesn't make sense, though, because the left-hand sides are neither arrays nor pointers. It should be like this:

wchar_t const * p = L"\x03A9";    // pointer to the first element of a string

wchar_t arr1[] = L"\x03A9";       // an actual array
wchar_t arr2[2] = L"\x03A9";      // ditto, but explicitly typed

std::wstring s = L"\x03A9";       // C++ only

On a tangent: This question of mine elaborates a bit on string literals and escape sequences.

Community
  • 1
  • 1
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • By "processing", I think the OP is asking how individual wide characters are represented in memory. – Jim Balter Apr 07 '13 at 18:43
  • @JimBalter: No different than any other integer. – Kerrek SB Apr 07 '13 at 18:43
  • Um, *I* know the answer. But in regard to "it is your responsibility to ensure that your data type can hold the value" -- the data type for chars in a wide string is `wchar_t`. – Jim Balter Apr 07 '13 at 18:44
  • Actually, in memory these two strings look the same, that's why I asked. \xFEFF would be byte switched, while \xFE\xFF would be not. – user206334 Apr 07 '13 at 18:48
  • @user206334: How did you find that out? There is no "byte switching" in the language. The language is about *values*, not about representations. – Kerrek SB Apr 07 '13 at 18:50
  • Looked at produced binary file cause it bugged me. File was produced using VC++, which might be acting not according to the standard. For example, L"a" would append \x00 byte to make character wide. I wonder in what cases it is done and whether I can rely on hexadecimal notation not to be meddled in any way. – user206334 Apr 07 '13 at 18:53
  • @user206334: I find that hard to believe, and indeed it [works as expected](http://ideone.com/0G3SoO) for me. – Kerrek SB Apr 07 '13 at 18:59
  • And why L"\xA9\x03" and L"\xA903" have different element count? – user206334 Apr 07 '13 at 19:10
  • 1
    @user206334: Because "There is no limit to the number of digits in a hexadecimal sequence." `L"\xA9\x03"` contains two hexadecimal sequences. `L"\xA903"` has one. – Keith Thompson Apr 07 '13 at 19:20
  • @KeithThompson So each hexadecimal sequence is treated as a separate character which is later converted to wide (let's say 2 byte) one? – user206334 Apr 07 '13 at 19:25
  • @user206334: each hexadecimal sequence is treated as a single *value*. – Kerrek SB Apr 07 '13 at 19:26
  • 1
    @user206334: Each hexadecimal sequence is not necessarily one byte. It specifies a *value* of type `wchar_t`. A wide string literal `L"..."` specifies an array of `wchar_t`. And the width of `wchar_t` is not necessarily "let's say 2" bytes; it's `sizeof (wchar_t)` bytes. – Keith Thompson Apr 07 '13 at 19:28
  • You might want to drop that last C++-specific example. – Keith Thompson Apr 07 '13 at 19:30
  • @KeithThompson wchar_t txt = L'\xFE\xFF'; then should not compile and produce an error cause these should be 2 wide characters instead of 1. But it compiles in MSVC and works as 2 byte wide character. Therefore, I mane an assumption that in L"\xFE\xFF" it is also treated as a 1 wide char. – user206334 Apr 07 '13 at 19:32
  • Where does standard say that each hexadecimal sequence is treated as a single value and not as a part of larger value? – user206334 Apr 07 '13 at 19:37
  • @user206334: In sections 6.4.4.4 and 6.4.4.5. – Kerrek SB Apr 07 '13 at 19:38
  • @user206334 (in your comment to Keith): that's like saying that `int n = 1;` shouldn't compile because `int` has 32 bits and `1` only has one bit. – Kerrek SB Apr 07 '13 at 19:40
  • @user206334: Incorrect. `0xFE` is a perfectly valid value of type `wchar_t`, just as `1` is a valid value of type `long int`. `0xFE`, `0x00FE`, and `0x000000FE` are the same value; so are `L'\xFE'`, `L\x00FE'`, and `L'\x000000FE'` – Keith Thompson Apr 07 '13 at 19:41
  • @KeithThompson I made a new edit explaining the issue if each separate hex sequence is treated as a separate wchar. You didn't notice that it had 2 hex sequences inside single quotes '\xFE\xFF'. – user206334 Apr 07 '13 at 19:49
  • Character constants (delimited by `'`) are very different from string literals (delimited by `"`), and I missed the distinction in your previous comment. `L"\xA9\x03"` is a wide string literal specifying an array containing 3 `wchar_t` elements, including the terminating null wide character. `L'\xA9\x03'` is a multi-character constant whose value is implementation-defined. Such things are rarely useful. (You didn't mention character constants in the original question.) – Keith Thompson Apr 07 '13 at 20:04
  • @KeithThompson There can't be multicharacter constants with wchar_t storage type cause in wchar_t fits exactly one wide character, AFAI understand the standard. Multicharacter character constants can only be integer ones like int baz = 'abcd'; In my case it seems that string literal L"\xA9\x03" is treated the same way as char literal containing two wide char elements (\x03A9 and \x0000) and not three. I need to test it with another compiler. – user206334 Apr 07 '13 at 20:10
  • 2
    @user206334: You're making too many assumptions. Read what the standard *says*. It says that a wide character constant may contain more than one "multibyte character", and that the value of such a constant is implementation-defined. `L'\xA9\x03'` is perfectly legal; it's a constant of type `wchar_t` with an implementation-defined value. Because its value is implementation-defined, it is neither portable nor likely to be particularly useful. (Wide string literals are an entirely different matter.) – Keith Thompson Apr 07 '13 at 20:22