For my pet project I am experimenting with string representations, but I arrived to some troubling results. Firstly, here is a short application:
#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const char* c, size_t len)
{
void* t = (void*)c;
fwrite(&len, sizeof(size_t), 1, fp);
fwrite(t, len, sizeof(char), fp);
}
int main()
{
FILE* fp = fopen("test.cod", "wb+");
const char* ABCDE = "ABCDE";
write_to_file(fp, ABCDE, strlen(ABCDE) );
const char* nor = "BBøæåBB";
write_to_file(fp, nor, strlen(nor));
const char* hun = "AAőűéáöüúBB";
write_to_file(fp, hun, strlen(hun));
const char* per = "CCبﺙگCC";
write_to_file(fp, per, strlen(per));
fclose(fp);
}
It does nothing special, just takes in a string, and writes it's length and the string itself to a file. Now, the file, when viewed as hex, looks like:
I am happy with the first result, 5 (the first 8 bytes, I'm on a 64 bit machine) as expected. However, the nor
variable in my expectation has 7 characters (since that is what I see there), but the C library think it has 0x0A
(ie: 10) characters (second row, starting with 0A
and 8 more characters). And the string itself contains double characters (the ø
is encoded as C3 B8
and so on...).
The same is true for the hun
and per
variables.
I did the same experiment with Unicode, the following is the application:
#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const wchar_t* c, size_t len)
{
void* t = (void*)c;
fwrite(&len, sizeof(size_t), 1, fp);
fwrite(t, len, sizeof(wchar_t), fp);
}
int main()
{
FILE* fp = fopen("test.cod", "wb+");
const wchar_t* ABCDE = L"ABCDE";
write_to_file(fp, ABCDE, wcslen(ABCDE) );
const wchar_t* nor = L"BBøæåBB";
write_to_file(fp, nor, wcslen(nor));
const wchar_t* hun = L"AAőűéáöüúBB";
write_to_file(fp, hun, wcslen(hun));
const wchar_t* per = L"CCبﺙگCC";
write_to_file(fp, per, wcslen(per));
fclose(fp);
}
The results here are the expected ones. 5 for the length of ABCDE
7 for the length of BBøæåBB
and so on, 4 bytes per character...
So here comes the question: what is the encoding of the standard C library, and how trustable is it when developing portable applications (ie: what I write out on a platform will be read back correctly on another one?) and what are the other recommendations taking in considerations what was presented above.