18

Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1

This is my expectation:

const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);

This is the result on g++ 4.8.1

const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() == 3);
  • The source file is ISO-8859(-1)
  • We use these compiler directives: -m64 -std=c++11 -pthread -O3 -fpic

In my world, regardless of the encoding of the source file the resulting utf8 string should be longer than 3.

Or, have I totally misunderstood the semantics of u8, and the use-case it targets? Please enlighten me.

Update

If I explicitly tell the compiler what encoding the source file is, as many suggested, I got the expected behavior for u8 literals. But, regular literals also gets encoded to utf8

That is:

const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);
assert( utf8 == "åäö");
  • compiler directive: g++ -m64 -std=c++11 -pthread -O3 -finput-charset=ISO8859-1
  • Tried a few other charset defined from iconv, ex: ISO_8859-1 and so on...

I'm even more confused now than before...

Andrew Brēza
  • 7,705
  • 3
  • 34
  • 40
Fredrik
  • 335
  • 2
  • 9
  • 4
    "The source file is ISO-8859(-1)" and gcc is supposed to know that... how? Use `-finput-charset=...` or use utf8 source files – n. m. could be an AI May 05 '14 at 12:07
  • You just answered your own question: `The source file is ISO-8859(-1)`. Reencode the source as `UTF-8` (or use `u8"\u00E5\u00E4\u00f6"` in your source code) and things should work ok... ([see what coliru has to say](http://coliru.stacked-crooked.com/a/40d42d74c84c48b4)). `clang++` even thows an error if I pass the iso-8859-1 encoded file to it: `main.cc:5:25: error: illegal character encoding in string literal std::string s1 { u8"" };` – Massa May 05 '14 at 12:14
  • 1
    Actually gcc is supposed to look at your locale to determine the encoding but I hear there were bugs in this area, so it's better to specify the input charset explicitly anyway. – n. m. could be an AI May 05 '14 at 12:15
  • [Live demo](http://ideone.com/gq7HOR). – n. m. could be an AI May 05 '14 at 12:21
  • @n.m, well the OS knows it, so asking the OS what charset the file is would be my guess. I bit risky to assume that the current locale correspond to all source-files. In our case that's pretty much true though. – Fredrik May 05 '14 at 12:55
  • I'm not aware of any OS that stores encoding as a file attribute. If yours does, you may ask gcc maintainers to support it. – n. m. could be an AI May 05 '14 at 13:01
  • @Massa, that implies that clang++ don't do any conversions between encoding? And the behavior differs from g++, who has interpret the standard correct? – Fredrik May 05 '14 at 13:02
  • Using the utf-8 characters inside the source code is, IIRC, _implementation-defined behaviour_... well-formed but unportable. checking... Section 2.2 of the standard says: – Massa May 05 '14 at 13:30
  • `1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.` – Massa May 05 '14 at 13:30
  • 1
    @Fredrik I had similar doubts about the text literals and how them are interpreted/stored and their relation with the source file encoding, but [my question was about Raw string literals](http://stackoverflow.com/questions/21460700/raw-string-literals-and-file-codification). – PaperBirdMaster May 05 '14 at 13:30
  • `(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)` – Massa May 05 '14 at 13:31
  • "But, regular literals also gets encoded to utf8" Why is that wrong/unexpected? The compiler is free to use utf8 for regular literals too. It is just not *required* to. – jalf May 05 '14 at 23:41
  • @jalf. If the source file encoding is ISO-8859-1, why should I expect any other encoding in string literals? How is the compiler free to use utf8 for regular literals? That would brake a whole lot of code if that was the case... In our case, we expect ISO-8859-1 charset in databases, and all over the place, in our 13MLOC code base. And also, if this is the case, then what's the point with different string literals? – Fredrik May 06 '14 at 07:46
  • @Fredrik: why would you expect the compiler to change which encoding it uses when *outputting* strings, just to conform to the encoding you used in the strings you gave it? What if you gave it multiple encodings? Say, on .cc file encoded as UTF-8, and one encoded as ISO-8859? Why shouldn't the compiler just use one encoding consistently? – jalf May 06 '14 at 14:17
  • The reason for `u8` string literals was *precisely* to give you a guarantee about those string literals. `"foo"` can be represented in whichever encoding the compiler happens to use. But `u8"foo"` is guaranteed to generate a UTF-8 string in the compiled code. It's just how the language is defined. But unfortunately, your sloppy code is not the compiler's problem. ;) The C++ language has *never* given you any guarantee about the encoding of plain string literals. – jalf May 06 '14 at 14:18
  • @jalf, I understand your explanation, but I don't concur :) I would certainly want the compiler to use the specified/deduced encoding for every TU. Otherwise the conversion is bound to fail. I mean, if the TU encoding effect the result of `u8`, it's pretty paramount. You do see the _code-braking_ problems though? I'm not saying you're wrong about the standard, I'm just saying that this behavior is just extremely strange IMHO, and I find it hard to believe that this behavior is intentional. – Fredrik May 06 '14 at 16:13
  • 1
    @Fredrik but the encoding of the source input has nothing to do with the encoding the compiler uses for strings it outputs. There is no reason why it should use the same encoding for both. The source encoding *can* vary between TU's, but it would be very unexpected if string literals in the resulting program used different encodings because of that. The TU encoding does not affect the result of a `u8` literal. Lying to your compiler (or not informing it) about what encoding the source text is means that it is impossible for the compiler to correctly convert to UTF-8 or to any other encoding – jalf May 06 '14 at 17:01
  • 1
    If you want text processed correctly by *any* piece of software, and it doesn't matter if it is a compiler, a text editor or anything else, then you must ensure that it knows which encoding the source text uses, and you must tell it which encoding to use for the output. If you give it ISO-8859 text when it *thinks* it is seeing UTF-8 text, then it will generate garbage output no matter which encoding you tell it to convert to. – jalf May 06 '14 at 17:02
  • @jalf, I'm not sure what you mean. I'm not lying or keeping anything from the compiler. The source file is ISO8859-1 and I'm explicitly telling the compiler that the encoding is ISO8859-1. Why on earth would the compiler come to the conclusion to output in utf8, even if it's entitle to by the standard (which I find it hard to believe). With that kind of reasoning the compiler could output arbitrary encoding at random for _ordinary_ literals. I'm not trying to be a pain in the butt :) I'm trying to understand. – Fredrik May 06 '14 at 21:36
  • 1
    @Fredrik In the *original* problem, you failed to tell the compiler that the encoding was ISO8859. It's a bit tricky to discuss this because you've expanded the scope and effectively started talking about an entirely different issue. For that separate issue, you are telling it the encoding of the *input* text the compiler sees. That says nothing about which encoding it should use in the *output* it generates. The `u8` prefix specifies the encoding for *some* string literals, but others use an implementation-defined encoding. – jalf May 07 '14 at 10:01
  • The reasoning is very, very simple: the compiler chooses an encoding. This is the encoding it uses. End of story. If you want full control over how your strings are encoded, then the best advice has *always* been to keep them out of your source code, and instead load them from a file. In general, stick to ASCII in your string literals, and use the `\u` escape sequences if you have to encode non-ASCII characters in string literals. For anything else, load strings dynamically from files. – jalf May 07 '14 at 10:02

3 Answers3

23

The u8 prefix really just means "when compiling this code, generate a UTF-8 string from this literal". It says nothing about how the literal in the source file should be interpreted by the compiler.

So you have several factors at play:

  1. which encoding is the source file written in (In your case, apparently ISO-8859). According to this encoding, the string literal is "åäö" (3 bytes, containing the values 0xc5, 0xe4, 0xf6)
  2. which encoding does the compiler assume when reading the source file? (I suspect that GCC defaults to UTF-8, but I could be wrong.
  3. the encoding that the compiler uses for the generated string in the object file. You specify this to be UTF-8 via the u8 prefix.

Most likely, #2 is where this goes wrong. If the compiler interprets the source file as ISO-8859, then it will read the three characters, convert them to UTF-8, and write those, giving you a 6-byte (I think each of those chars encodes to 2 byte in UTF-8) string as a result.

However, if it assumes the source file to be UTF-8, then it won't need to do a conversion at all: it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8), and since you asked for the output string to be UTF-8 as well, it just outputs those same 3 bytes.

You can tell GCC which source encoding to assume with -finput-charset, or you can encode the source as UTF-8, or you can use the \uXXXX escape sequences in the string literal ( \u00E5 instead of å, for example)

Edit:

To clarify a bit, when you specify a string literal with the u8 prefix in your source code, then you are telling the compiler that "regardless of which encoding you used when reading the source text, please convert it to UTF-8 when writing it out to the object file". You are saying nothing about how the source text should be interpreted. That is up to the compiler to decide (perhaps based on which flags you passed to it, perhaps based on the process' environment, or perhaps just using a hardcoded default)

If the string in your source text contains the bytes 0xc5, 0xe4, 0xf6, and you tell it that "the source text is encoded as ISO-8859", then the compiler will recognize that "the string consists of the characters "åäö". It will see the u8 prefix, and convert these characters to UTF-8, writing the byte sequence 0xc3, 0xa5, 0xc3, 0xa4, 0xc3, 0xb6 to the object file. In this case, you end up with a valid UTF-8 encoded text string containing the UTF-8 representation of the characters "åäö".

However, if the string in your source text contains the same byte, and you make the compiler believe that the source text is encoded as UTF-8, then there are two things the compiler may do (depending on implementation:

  • it might try to parse the bytes as UTF-8, in which case it will recognize that "this is not a valid UTF-8 sequence", and issue an error. This is what Clang does.
  • alternatively, it might say "ok, I have 3 bytes here, I am told to assume that they form a valid UTF-8 string. I'll hold on to them and see what happens". Then, when it is supposed to write the string to the object file, it goes "ok, I have these 3 bytes from before, which are marked as being UTF-8. The u8 prefix here means that I am supposed to write this string as UTF-8. Cool, no need to do a conversion then. I'll just write these 3 bytes and I'm done". This is what GCC does.

Both are valid. The C++ language doesn't state that the compiler is required to check the validity of the string literals you pass to it.

But in both cases, note that the u8 prefix has nothing to do with your problem. That just tells the compiler to convert from "whatever encoding the string had when you read it, to UTF-8". But even before this conversion, the string was already garbled, because the bytes corresponded to ISO-8859 character data, but the compiler believed them to be UTF-8 (because you didn't tell it otherwise).

The problem you are seeing is simply that the compiler didn't know which encoding to use when reading the string literal from your source file.

The other thing you are noticing is that a "traditional" string literal, with no prefix, is going to be encoded with whatever encoding the compiler likes. The u8 prefix (and the corresponding UTF-16 and UTF-32 prefixes) were entroduced precisely to allow you to specify which encoding you wanted the compiler to write the output in. The plain prefix-less literals do not specify an encoding at all, leaving it up to the compiler to decide on one.

jalf
  • 243,077
  • 51
  • 345
  • 550
  • 1
    _it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8)_... this is where `clang++`, for instance, gives you an error message telling that the bytes are invalid. – Massa May 05 '14 at 12:43
  • @Massa, I haven't got clang++ at work. Do I understand you correctly that if the source file is encoded in, let's say ISO8859-1, and clang++ gets this information, it would convert extended ASCII to the corresponding utf8 representation? Otherwise I don't get the use-case :) – Fredrik May 05 '14 at 14:44
  • @jalf, It worked as expected if I explicitly told the compiler which encoding to use. As Massa said, clang++ reports an error if the characters is not valid utf8, which, to me, is a prefered behavior (given that clang++ is able to do actual conversions). – Fredrik May 05 '14 at 14:49
  • @Fredrik yeah, I agree, that is definitely preferred behavior. And yeah, the only thing Clang does different is that it *warns* you if you feed it garbage UTF-8. Both Clang and G++ can perform the conversion *if they know which conversion to perform*. If you tell the compiler that the source is ISO-8859, and ask it to generate a UTF-8 string, then it will perform the necessary conversion. The problem was that you didn't tell it that the source was ISO-8859 – jalf May 05 '14 at 15:02
  • @jalf, Hmm, I still got unexpected behavior. I get the expected utf8 string with u8 literals (haven't checked them, but they seems ok). But, I also get utf8 strings with regular literals. that is: `std::string{ u8"åäö} == std::string{ "åäö"}`. I'll update the question... – Fredrik May 05 '14 at 15:13
  • @jalf nah, `clang++` **does not have** the `-finput-charset` option; so, AFAICT, it can either assume UTF-8 encoded input or locale-dependent (and, in my case, again UTF-8 encoded) input. – Massa May 05 '14 at 15:25
  • 1
    @Fredrik as I said in a comment under your question, there's nothing unexpected about that. The compiler is under no obligation to use ISO-8859 for regular literals. – jalf May 06 '14 at 14:19
  • This was a while ago, but I didn't resolve it, so I'm doing it now... I'm not certain that `u8` is semantically sound still, but at least I've learned "not to use it". I can understand the arguments why it works as you guys explained it, but I don't, still, see the benefit/use-case for it's use. IMHO it's somewhat broken, or it might be that I'm somewhat broken... – Fredrik Sep 19 '18 at 20:19
5

In order to illustrate this discussion, here are some examples. Let's consider the code:

int main() {
  std::cout << "åäö\n";
}

1) Compiling this with g++ -std=c++11 encoding.cpp will produce an executable that yields:

% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a

In other words, two bytes per "grapheme cluster" (according to unicode jargon, i.e. in this case, per character), plus the final newline (0a). This is because my file is encoded in utf-8, the input-charset is assumed to be utf-8 by cpp, and the exec-charset is utf-8 by default in gcc (see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html). Good.

2) Now if I convert my file to iso-8859-1 and compile again using the same command, I get:

% ./a.out | od -txC
0000000 e5 e4 f6 0a

i.e. the three characters are now encoded using iso-8859-1. I am not sure about the magic going on here, as this time it seems that cpp correctly guessed that the file was iso-8859-1 (without any hint), converted it to utf-8 internally (according to the link above) but the compiler still stored the iso-8859-1 string in the binary. This we can check by looking at the .rodata section of the binary:

% objdump -s -j .rodata a.out

a.out:     file format elf64-x86-64

Contents of section .rodata:
400870 01000200 00e5e4f6 0a00               ..........

(Note the "e5e4f6" sequence of bytes).
This makes perfect sense as a programmer who uses latin-1 literals does not expect them to come out as utf-8 strings in his program's output.

3) Now if I keep the same iso-8859-1-encoded file, but compile with g++ -std=c++11 -finput-charset=iso-8859-1 encoding.cpp, then I get a binary that ouptuts utf-8 data:

% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a

I find this weird: the source encoding has not changed, I explicitly tell gcc it is latin-1, and I get utf-8 as a result! Note that this can be overriden if I explicitly request the exec-charset with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp:

% ./a.out | od -txC
0000000 e5 e4 f6 0a

It is not clear to me how these two options interact...

4) Now let's add the "u8" prefix into the mix:

int main() {
  std::cout << u8"åäö\n";
}

If the file is utf-8-encoded, unsurprisingly compiling with defaults char-sets (g++ -std=c++11 encoding.cpp), the output is utf-8 as well. If I request the compiler to use iso-8859-1 internally instead (g++ -std=c++11 -fexec-charset=iso-8859-1 encoding.cpp), the output is still utf-8:

% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a

So it looks like the prefix "u8" prevented the compiler to convert the literal to the execution character set. Even better, if I convert the same source file to iso-8859-1, and compile with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp, then I still get utf-8 output:

% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a

So it seems that"u8" actually acts as an "operator" that tells the compiler "convert this literal to utf-8".

v.p
  • 51
  • 1
  • 2
2

I found by trial and error that on MSVC, e.g. "ü" and "\u00FC" did not produce the same string. (Of course, ü has the code point U+00FC.)

My take is that for maximally portable code, one should not rely on assumptions the compilers make or encodings it has to be told.

I found two reliable ways to put UTF-8 in string literals:

  1. Use UTF-8 code-units like this: "\xC3\xBC"
  2. Use the the u8 prefix in conjunction with \u escape sequences: u8"\u00FC".

In the first one, you tell the compiler what to do, and in the second one what you want.

Just for the record, neither unprefixed "\u00FC" nor u8"ü" gave me UTF-8 coded strings on all platforms, compilers, and input encodings.

There are at least two good reasons to prefer u8"s\u00FCchtig" (süchtig) over "s\xC3\xBCchtig":

  • You can search for U+00FC in any reasonable character map.
  • \u takes exactly 4 hex digits and for non-BMP charachters, \U with 8 hex digits has you covered; on the other hand, \x consumes as many hex digits as it can, e.g. "s\xC3\xBCchtig" doesn’t actually work: It looks at \xBCc as 1 value, meaning you have to split the string in two literals: "s\xC3\xBC""chtig".

I still cannot answer you how to transition to C++20 with this, as u8 stuff got its own type: char8_t.

Quirin F. Schroll
  • 1,302
  • 1
  • 11
  • 25