Print Unicode characters by code

Question

I have an array of uint32_t. Each is value representing a Unicode characters. I want to print the array like a string but I'm not able to get that working.

I tried a lot of different things

typedef struct String {
    uint32_t *characters;
    unsigned long length;
} WRString;

char* WRStringToString(WRString *wstr){
    char *string = malloc(sizeof(char) * wstr->length * 4);
    int i = 0;
    int j = 0;
    for (; i < wstr->length; i++) {
        string[j++] = wstr->characters[i];

        char byte2 = (char)wstr->characters[i] >> 8;
        if (byte2) {
            string[j++] = byte2;

            char byte3 = (char)wstr->characters[i] >> 16;
            if (byte3) {
                string[j++] = byte3;

                char byte4 = (char)wstr->characters[i] >> 24;
                if (byte4) {
                    string[j++] = byte4;
                }
            }
        }
    }
    return string;
}

Always with

WRString *string; //Characters are 0xD6, 0x73, 0x74, 0x65, 0x72, 0x72, 0x65, 0x69, 0x63, 0x68

I tried:

setlocale(LC_CTYPE,"de_DE.UTF-8");
puts(WRStringToString(string));

Gives \326\377\377\377sterreich.

wprintf(L"%s",WRStringToString(string));

Gives the same as long as no local is set.

Printing UTF-8 strings with printf - wide vs. multibyte string literals and Printing Unicode Character (stored in variables) in C do not really help me.

Any suggestions?

Those aren't UTF-8 characters in the string, or you could just print them directly. They're Unicode codepoints. Please keep your terminology straight. — Mark Ransom, Jan 07 '15 at 16:37
@MarkRansom, no, he seems to have just the utf8 bytes encoded in his `uint32_t` — Jens Gustedt, Jan 07 '15 at 16:39
There are so many terminology problems in the question that it's unclear what you're asking. I can't tell if it's just a language problem or if there's a misunderstanding with regard to Unicode concepts like UTF-8, characters, etc. — Adrian McCarthy, Jan 07 '15 at 16:40
@JensGustedt No, the first character in his example is 0xD6, which is the codepoint for `Ö`. I doubt it's a legitimate UTF-8 sequence. — Mark Ransom, Jan 07 '15 at 16:41
@JensGustedt Am I confused? Please could you tell me what I mixed? — idmean, Jan 07 '15 at 16:43
@idmean, you are taking unicode code points for utf8 encoding. this is not at all the same thing. — Jens Gustedt, Jan 07 '15 at 16:45
@JensGustedt Oh yes, I see. UTF8 are only Unicode code points that are encoded into up to 4 bytes, aren't they? — idmean, Jan 07 '15 at 16:46
@idmean, sort of, there is really a non-trivial encoding procedure, which you probably don't want to code yourself. Just use the C library as of my answer. — Jens Gustedt, Jan 07 '15 at 16:48
Note operator precedence: cast beats shift. `(char)wstr->characters[i] >> 8` --> `((char)wstr->characters[i]) >> 8`. Certain OP wants `(char) (wstr->characters[i] >> 8)`. There are other issues too. — chux - Reinstate Monica, Jan 07 '15 at 16:53
@chux Thanks for pointing that out. What other issues do you mean? — idmean, Jan 07 '15 at 16:57
Other issues: "0xD6, 0x73, 0x74, 0x65,..." is certainly not UTF8, yet locale is `"de_DE.UTF-8"`. Minor: why use `unsigned long length` instead of `size_t` or `unsigned`? Should use `wstr->length * sizeof(uint32_t) + 1` instead of `sizeof(char) * wstr->length * 4`. `string` is not null character terminated. — chux - Reinstate Monica, Jan 07 '15 at 17:04

Jens Gustedt · Answer 1 · 2015-01-07T17:13:29.217

2

Theses just seem to be unicode code points. Store them in a wchar_t string, one by one, and then print this with

printf("%ls\n", wstring);

You'd have to set the locale right at the start of your program to the default of the system:

set_locale(LC_ALL, "");

edited Jan 07 '15 at 17:13

answered Jan 07 '15 at 16:43

Jens Gustedt

76,821
6
102
177

According to [this](http://en.wikipedia.org/wiki/Wide_character#Programming_specifics) wchar_t can be as small as one byte. So it seems not perfect. – idmean Jan 07 '15 at 16:55
I can't see anything running `wchar_t l[11] = {0x1F330, 0xD6, 0x73, 0x74, 0x65, 0x72, 0x72, 0x65, 0x69, 0x63, 0x68}; printf("%ls\n", l);` Did I again mix something? – idmean Jan 07 '15 at 17:06
did you set the locale to something utf8? The C (default) locale wouldn't handle these characters. – Jens Gustedt Jan 07 '15 at 17:08
I tried with and without `setlocale(LC_CTYPE,"de_DE.UTF-8");` – idmean Jan 07 '15 at 17:09
And no, on any reasonable platforms `wchar_t` is at least 16 bit nowadays. Don't reinvent the wheel. – Jens Gustedt Jan 07 '15 at 17:09
1

also you forgot the `0` character at the endof your string. An alternative would just be to use a normal string `char s[] = "Östereich"` should work out of the box if you are using the correct locale. – Jens Gustedt Jan 07 '15 at 17:11
Ah yes, it was just the 0 character. Thanks. I can't use a char array for good reasons. – idmean Jan 07 '15 at 17:13
`wchar_t` is 16-bit on some platforms, like Windows, and thus can be used for UTF-16 encoding, whereas it is 32bit on other platforms, and thus can be used for UTF-32 encoding instead. You have to take encoding into account when converting `uint32_t` values (which are presumably holding raw Unicode codepoints) to `wchar_t` strings (which must be UTF encoded). This discrepancy is why modern C++ now has standard `char16_t` (UTF-16) and `char32_t` (UTF-32) types. Best to use an existing Unicode library, if not native C++ functionality. – Remy Lebeau Jan 07 '15 at 22:11
@RemyLebeau, I know, but the use of code points outside the 16 bit range is extremely rare. For most everyday coding needs `wchar_t` is completely sufficient. BTW, C11 has the same new 16 bit and 32 character types. Unfortunately there does not come much functionality with it, so they are mainly useless. – Jens Gustedt Jan 07 '15 at 22:36
2

@JensGustedt: Rare, but not impossible. The SMP plane in particular (U+10000 – U+1FFFF) contains some useful codepoints, like musical and mathematical symbols, and emoji symbols (which are gaining popularity in chat/IM systems). Clearly those do not fit in 16bits without the use of UTF-16 surrogates. – Remy Lebeau Jan 07 '15 at 23:35

idmean · Accepted Answer · 2015-01-09T16:06:08.190

1

Jens Gustedt's answer was a point into the right direction but I keep using uint32_t, because I need to support Unicode's Emojis and wchar_t can be too small for those. (as said above by Remy Lebeau)

This seems to be working perfectly fine:

setlocale(LC_CTYPE,"de_DE.UTF-8");
printf("%ls\n", string->characters);

edited Jan 09 '15 at 16:06

answered Jan 09 '15 at 14:26

idmean

14,540
9
54
83

No, it only seems so. In exactly the case that `wchar_t` is only 16 bit this will explode under your feet. A platform that has 16 bit `wchar_t` simply can't handle Emojis and stuff like that. – Jens Gustedt Jan 09 '15 at 15:50
@JensGustedt Yes, I'm aware of that. But I'll stick to `uint32_t` because a lot of the other code used in the project is already using `uint32_t`. Even on a system on which I can't print using the above method, I'll be at least able to do the other comparisons I need to do. (The unicode code points for emojis are *very* important) – idmean Jan 09 '15 at 15:54
But then be sure to buildin something that inhibits compilation on a machine with 16 bit `wchar_t`. Just passing the wrong pointer type to `printf` is a time bomb. – Jens Gustedt Jan 09 '15 at 16:02
@JensGustedt Thanks for the adivce. I planned to do something like this. – idmean Jan 09 '15 at 16:05

Print Unicode characters by code

2 Answers2