The C stdio character encoding

Question

For my pet project I am experimenting with string representations, but I arrived to some troubling results. Firstly, here is a short application:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const char* c, size_t len)
{
    void* t = (void*)c;
    fwrite(&len, sizeof(size_t), 1, fp);
    fwrite(t, len, sizeof(char), fp);
}
int main()
{
    FILE* fp = fopen("test.cod", "wb+");
    const char* ABCDE = "ABCDE";
    write_to_file(fp, ABCDE, strlen(ABCDE) );
    const char* nor = "BBøæåBB";
    write_to_file(fp, nor, strlen(nor));
    const char* hun = "AAőűéáöüúBB";
    write_to_file(fp, hun, strlen(hun));
    const char* per = "CCبﺙگCC";
    write_to_file(fp, per, strlen(per));
    fclose(fp);
}

It does nothing special, just takes in a string, and writes it's length and the string itself to a file. Now, the file, when viewed as hex, looks like:

hex dump of standard char* output

I am happy with the first result, 5 (the first 8 bytes, I'm on a 64 bit machine) as expected. However, the nor variable in my expectation has 7 characters (since that is what I see there), but the C library think it has 0x0A (ie: 10) characters (second row, starting with 0A and 8 more characters). And the string itself contains double characters (the ø is encoded as C3 B8 and so on...).

The same is true for the hun and per variables.

I did the same experiment with Unicode, the following is the application:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const wchar_t* c, size_t len)
{
    void* t = (void*)c;
    fwrite(&len, sizeof(size_t), 1, fp);
    fwrite(t, len, sizeof(wchar_t), fp);
}

int main()
{
    FILE* fp = fopen("test.cod", "wb+");
    const wchar_t* ABCDE = L"ABCDE";
    write_to_file(fp, ABCDE, wcslen(ABCDE) );
    const wchar_t* nor = L"BBøæåBB";
    write_to_file(fp, nor, wcslen(nor));
    const wchar_t* hun = L"AAőűéáöüúBB";
    write_to_file(fp, hun, wcslen(hun));
    const wchar_t* per = L"CCبﺙگCC";
    write_to_file(fp, per, wcslen(per));
    fclose(fp);
}

The results here are the expected ones. 5 for the length of ABCDE 7 for the length of BBøæåBB and so on, 4 bytes per character...

hex dump of whcar_t* output

So here comes the question: what is the encoding of the standard C library, and how trustable is it when developing portable applications (ie: what I write out on a platform will be read back correctly on another one?) and what are the other recommendations taking in considerations what was presented above.

You're looking at the question the wrong way. Look at the string `nor` in the debugger before you print it. You'll see that it is 10 characters long. That's because your source file was saved in UTF-8 encoding. — Raymond Chen, Dec 20 '13 at 09:18
related: [How does file encoding affect C++11 string literals?](http://stackoverflow.com/q/6794590/4279) — jfs, Dec 20 '13 at 09:23
More remarkable is that this hex viewer *also* interprets UTF8 codes, rather than showing actual raw bytes. (Compare with the Unicode dump.) — Jongware, Dec 20 '13 at 09:26
@Jongware Good point. I'm curious as to just what his environment is. I'm not aware of any environments which support Unicode this well. — James Kanze, Dec 20 '13 at 09:32
@JamesKanze the environment is a Xubuntu 12.10, the viewer is mcview, the editor I used to write the text is vi. Please ask if you're interested in more settings :) Terminal 0.4.8 - Xfce terminal emulator, LXDE desktop environment, not Xfce (yeah, it's a pretty big mess :D ) — Ferenc Deak, Dec 20 '13 at 09:33

score 5 · Answer 1 · answered Dec 20 '13 at 09:18

5

As far as I know, the standard C library does no encoding at all. I suppose your input file in the first case uses UTF-8 as encoding, thus your string constants will end up as UTF-8-string constants in compiled code. That is why you get the string with a length of 10 chars.

fwrite takes an (untyped) byte array as argument. Since it does not know anything about the bytes processed, it cannot do any encoding-conversion at all here.

Regarding portability, you should be more careful about things like pointer lengths. fwrite(&len, sizeof(size_t), 1, fp)can yield different results on different platforms, maybe causing your file to be read incorrectly. Also (especially with multi-byte encodings) you have to be careful with the platform's endianness.

For anything else, you can be sure, that your standard library will put the bytes to disk exactly as you pass them, but when processing them as text, you have to make sure that you use the same encoding on all platforms.

answered Dec 20 '13 at 09:18

user1781290

2,674
22
26

The standard C library *does* have to do encoding on platforms like Windows that operate in UTF-16 internally. – dan04 Dec 20 '13 at 16:08
@dan04 Ok, but this relates to things like Console-I/O, I guess? – user1781290 Dec 20 '13 at 16:19
Also to things like `fopen`. – dan04 Dec 20 '13 at 16:27
@dan04 I did not know this before. Thank you for the information, but is this a MS extension of the standard? – user1781290 Dec 20 '13 at 16:33
It's not an extension. It's how functions from the standard library have to be *implemented* when C's execution character set is different from the encoding that the OS uses to store filenames. MSVC++ does provide extensions like `_wfopen` that allow the developer to pass UTF-16 strings directly. It's *necessary* to use these functions on Windows because otherwise you can't straightforwardly open a file whose name contains characters outside the ANSI code page. – dan04 Dec 20 '13 at 17:01
@dan04 Are you actually talking about the encoding of the filename? I meant the `css` extension, found in MSDN reference to fopen. Filenames have to be converted, of course, no question – user1781290 Dec 20 '13 at 17:09
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/43636/discussion-between-dan04-and-user1781290) – dan04 Dec 20 '13 at 23:31

score 3 · Answer 2 · answered Dec 20 '13 at 09:30

There is no real answer to your question. Practically everything involving encoding is implementation dependent, and often locale dependent as well. Judging from appearances, your narrow character encoding is Unicode UTF-8, and your wide character encoding is Unicode UTF-32LE. This is far from universal, however; even today, I suspect that the most widespread narrow character encoding is ISO 8859-1, and there are still machines which use EBCDIC. For wide character encodings, both UTF-16 and UTF-32 are widespread, and some machines still use older encodings as well. (If you use C++ style IO, you can embed a specific encoding in the stream itself.)

As for your code, fwrite doesn't know (or care) that it is dealing with characters. It just copies an image of memory out to disk (which makes it pretty useless, except for sequences of pre-formatted bytes, since such images generally can't be reliably read back in).

As for strlen: it doesn't know about multibyte characters; it returns the number of bytes until the first 0 byte, not the number of characters. The number of bytes is likely to be superior to the number of characters for any multibyte encoding format. But the issue is even more complex. Independently of the encoding format, there are cases where a sequence of more than one code point will result in a single character; e.g. "\u0063\u0302" will represent a single character, although functions like strlen or wcslen (assuming a wide character string literal) will report more.

score 0 · Answer 3 · answered Dec 20 '13 at 10:05

Standard C Library does not encode anything.

If you need portability, it is better to handle then encoding explicitly. libiconv and libicu both work well. You only need to convert data to a certain encoding, for example UTF8, then save the string to disk using fwrite().

You should also use char not wchar_t, because wchar_t is at least 16 bits, which may lead to endianess problem on a different platform.

As for strlen(), it is designed to be used with ANSI string, to determine a string of wchar_t, you should use wcslen() (if available) instead. otherwise, it is better to use explicit conversion on strings.

score 0 · Answer 4 · answered Dec 20 '13 at 12:39

As our collegues pointed out, fwrite does not know about the encoding.

First, take a serius looking at this link, it is has a great overview of encodings:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

If you don´t want to use any external libs you will have to deal with your strings in a low level manner.

For instance, if you are sure about using wchar_t (e.g., expecting UTF-16 encoding), an approach is to resize the len passed to write_to_file according to platform size of wchar_t, so fwrite will write the correct number of bytes.

Like this:

write_to_file(fp, ABCDE, sizeof(wchar_t)*wcslen(ABCDE) );

You have 5 wchar_t´s, but in Windows/MingGW each of then is 2 bytes long.

Remember to consider the BOM (Byte Order Mark) when dealing with UTF-16. It can be valuable to get bytes in the right order.

Encodings like UTF-8 has a strictly more complex approach if you want to deal with it (take a look at Wikipedia), and maybe using a out-of-the-shelf lib can be a good idea. I don´t have extensive experience on UTF-8 over C++ and I´ll let the collegues indicate a good lib!

To finalize, take a look in new strings that arrived at C++11:

u32string and u16string

That can be helpful to guarantee character size.

(and don´t forget the old wstring, but as usual you wchat_t if platform dependent )

The C stdio character encoding

4 Answers4