How to handle non-ASCII strings properly in C?

Question

My idea was to code a Hangman-like game in C. I want it to be able to use German words with umlauts (eg: ä, ü, ö) and also Greek words (completely non-ASCII characters).

My compiler and my terminal can handle Unicode well. Displaying the strings works well.

But how should I do operations on these strings? For the German language I could maybe handle the 6 upper- and lowercase accented characters by taking care of these cases in the functions. But considering Greek it seems like impossible.

I wrote this test code. It outputs the string, the length of the string (of course wrong, because the UTF-8 sequences take the place of two characters), and the value of the individual characters of the string in plain text and hex.

#include <stdio.h>
#include <string.h>

int main() {
    printf("123456789\n");
    char aTestString[] = "cheese";
    printf("%s ist %d Zeichen lang\n", aTestString, strlen(aTestString));
        
    for (int i = 0; i < strlen(aTestString); i++) {
        printf("( %c )", aTestString[i]);   // char als char
        printf("[ %02X ]", aTestString[i]); // char in hexadezimal
    }

    printf("\n123456789\n");
    char aTestString2[] = "Käse";
    printf("%s has %d characters\n", aTestString2, strlen(aTestString2));
        
    for (int i = 0; i < strlen(aTestString2); i++) {
        printf("( %c )", aTestString2[i]);  // char als char
        printf("[ %02X ]", aTestString2[i]); // char in hexadezimal
    }
    
    printf("\n123456789\n");    
    char aTestString3[] = "λόγος";
    printf("%s has %d characters\n", aTestString3, strlen(aTestString3));

    for (int i = 0; i < strlen(aTestString3); i++) {
        printf("( %c )", aTestString3[i]);  // char als char
        printf("[ %02X ]", aTestString3[i]); // char in hexadezimal
    }
}

For example, what is the recommended way to count the Unicode characters, or to see whether a specific Unicode character (that is, code point) is in the string? I am quite sure there must some simple solution because such characters are often used in passwords for example.

Here the output of the test program:

123456789
cheese has 6 character
( c )[ 63 ]( h )[ 68 ]( e )[ 65 ]( e )[ 65 ]( s )[ 73 ]( e )[ 65 ]
123456789
Käse has 5 characters
( K )[ 4B ](  )[ FFFFFFC3 ](  )[ FFFFFFA4 ]( s )[ 73 ]( e )[ 65 ]
123456789
λόγος has 10 characters
(  )[ FFFFFFCE ](  )[ FFFFFFBB ](  )[ FFFFFFCF ](  )[ FFFFFF8C ](  )[ FFFFFFCE ](  )[ FFFFFFB3 ](  )[ FFFFFFCE ](  )[ FFFFFFBF ](  )[ FFFFFFCF ](  )[ FFFFFF82 ]

To get the number of code-points in a Unicode string you need a third-party library. Like [the ICU library](https://icu.unicode.org). — Some programmer dude, Jul 02 '23 at 17:49
Your code would be easier to understand if you translated the output to English. — Andreas Wenzel, Jul 02 '23 at 18:07
iam sorry for that "cheese ist 6 Zeichen lang" just means "cheese has 6 characters". ill fix this in my code above. — ᛉᛉᛉ ᛉᛉᛉ, Jul 02 '23 at 18:10
@Someprogrammerdude No you don't. It is a couple of lines of plain C code. — n. m. could be an AI, Jul 02 '23 at 18:38
So you suggest, i do the imput reagulary with scanf and fgets, but write some functions for searching for characters and lenght calculation based on byte to byte view on the string? — ᛉᛉᛉ ᛉᛉᛉ, Jul 02 '23 at 18:44
@ᛉᛉᛉᛉᛉᛉ No. Use `wchar_t` and functions that work with wide strings, and just count `wchar_t`s. You don't need anything more complicated than that until you start handling exotic scripts and rare special characters. For German and Greek it's pleny enough. — n. m. could be an AI, Jul 02 '23 at 18:52
Would you want the uppercase of *tschüß* to be *TSCHUESS* or *TSCHÜSS* or *TSCHÜẞ*? :) More seriously, you'll have to cope with normalization issues as well, since a character like *ü* might be a single codepoint, `LATIN SMALL LETTER U WITH DIAERESIS`, but it might also be two codepoints, `LATIN SMALL LETTER U` followed by `COMBINING DIAERESIS`. Those are respectively the NFC and NFD forms of two canonically equivalent sequences. — tchrist, Jul 02 '23 at 20:47
Related C++ question: [How to uppercase/lowercase UTF-8 characters in C++?](https://stackoverflow.com/q/36897781/12149471) — Andreas Wenzel, Jul 02 '23 at 21:18
@tchrist "you'll have to cope" Probably not. Normal users with German keyboards won't type decomposed anything. It is theoretically possible for them to type the combining diaeresis, but who cares. Just tell them not to do that. — n. m. could be an AI, Jul 03 '23 at 07:42

score 5 · Answer 1 · answered Jul 02 '23 at 19:05

C's multi-byte string utilities are useful in this case. Using mbrlen, for example, one way to find the number of characters in a string (albeit probably a very naive one that I just bodged together right now) is this:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

size_t string_size(const char *s)
{
    mbstate_t state = {0};
    size_t len = 0;
    for (; *s != '\0'; ++len)
    {
        unsigned c_len;
        for (c_len = 1; mbrlen(s+c_len-1, 1, &state) == -2; ++c_len) {}
        s += c_len;
    }
    return len;
}

int main(void)
{
    setlocale(LC_ALL, "en_US.utf8");
    const char *s = "zß水";
    printf("%zu\n", string_size(s));
}

// Output: 4

Using the same function mbrlen, you could also extract individual characters through finding their lengths. There are also functions to convert between multibyte characters and wide characters if you want to work with that.

Daniel · Answer 2 · 2023-07-02T17:57:03.070

2

To handle Unicode characters correctly, you use wide characters wchar_t and wide string functions instead of regular characters char and string functions. The wide character and string functions are prefixed with a w (e.g., wprintf, wcslen, wcsstr) and are designed to handle multibyte characters properly.

An example of printing out a phrase with unicode characters:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
    setlocale(LC_ALL, ""); // Set the locale to handle wide characters

    wchar_t phrase[] = L"Hello, 世界!"; // Unicode phrase

    wprintf(L"%ls\n", phrase); // Print the Unicode phrase

    return 0;
}

edited Jul 02 '23 at 17:57

answered Jul 02 '23 at 17:55

Daniel

23
6

4

The problem with `wchar_t` is that it's not really portable. On e.g. Linux systems `wchar_t` is a 32-bit type, while on Windows using the MSVC compiler it's 16 bits. Windows also uses UTF-16, which is still a variable-length encoding scheme (and will lead to the exact same problems the OP already have). Other systems, and even other *compilers* might use other sizes or encodings. – Some programmer dude Jul 02 '23 at 18:02
Thought UTF-16 is fixed lenght encoding with 2 Byte for every char? – ᛉᛉᛉ ᛉᛉᛉ Jul 02 '23 at 18:35
1

@Someprogrammerdude The OP has specified German. No need to handle surrogates. `wchar_t` is plenty enough for this. – n. m. could be an AI Jul 02 '23 at 18:42
@ᛉᛉᛉ ᛉᛉᛉ "Thought UTF-16 is fixed lenght encoding with 2 Byte for every char?" You were mistaken. Now that you've looked it up you know that it's not true. – n. m. could be an AI Jul 02 '23 at 18:45
@n.m.willseey'allonReddit the answerer has no clue; this is just copy-pasted generated poo. If they knew what they were doing, they wouldn't have blindly copy-pasted something into an AI. – Adriaan Jul 03 '23 at 06:26
@n.m.willseey'allonReddit Hey, yes I did use ChatGPT. I tested the code and made sure it worked. This was my second answer on this site and I didn't know it was not allowed. Now I do and will not use it. None of my other (one) answers used it, and it is entirely my fault. – Daniel Jul 03 '23 at 23:58

How to handle non-ASCII strings properly in C?

2 Answers2