2

I want to print blå using UTF-8 but I do not know how to do it. UTF-8 for b is 62, l is 6c and å is c3 a5. I am not sure what to make with the å character. Here is my code:

#include <stdio.h>

int main(void) {

    char myChar1 = 0x62;  //b
    char myChar2 = 0x6C;  //l
    char myChar3 = ??     //å

    printf("%c", myChar1);
    printf("%c", myChar2);
    printf("%c", myChar3);

    return 0;
}

I also tried this:

#include <stdio.h>

#define SIZE 100

int main(void) {

    char myWord[SIZE] = "\x62\x6c\xc3\xa5\x00";

    printf("%s", myWord);

    return 0;
}

However, the output was:

blå

Finally, I tried this:

#include <stdio.h>
#include <locale.h>

#define SIZE 100

int main(void) {

    setlocale(LC_ALL, ".UTF8");
    char myWord[SIZE] = "\x62\x6c\xc3\xa5\x00";

    printf("%s", myWord);

    return 0;
}

Same output as before.

I am not sure I understand unicode fully. If I understand it correctly, UTF-16 and UTF-32 use wide characters, where each character requires the same number of bytes (2 or 4 for UTF-16). On the other hand, UTF-8 uses wide characters where the size may vary (1-4 bytes). I know the first 128 characters require 1 byte, and almost all of latin-1 can be described with 2 bytes etc. Since UTF-8 does not require wide characters, I do not need to use wchar functions in my code. Therefore, I do not see why my second and/or third code will not work. My only solution would be to include setmode to change the encodings of stdin and stdout, although I am not sure I that would work and I am not sure how to implement it.

Summary:

Why doesn't my code work?

I am on windows and VScode and have MINGW32 as compiler.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Read about the difference between codepoints and codeunits. UTF-32 uses 32bit codeunits, where every Unicode codepoint fits in 1 codeunit. UTF-16 uses 16bit codeunits, where codepoints <= U+FFFF fit in 1 codeunit, and higher codeunits fit in 2 codeunits (using surrogates). UTF-8 uses 8bit codeunits, where each codepoint fits in 1-4 codeunits depending on its value... – Remy Lebeau Jul 01 '23 at 14:49
  • The `wchar_t` type is 2 bytes on Windows, and 4 bytes on most other platforms. So, `wchar_t` is used for UTF-16 on Windows, and for UTF-32 elsewhere. Modern compilers also have `char16_t` and `char32_t` for handling UTF-16/32. The `char` type is 1 byte on all platforms, by definition. Some modern compilers also have `char8_t` for handling UTF-8. – Remy Lebeau Jul 01 '23 at 14:50

1 Answers1

4

Your second attempt is correct and does output UTF-8 as you wanted. The problem is that your terminal doesn't display UTF-8. See Displaying Unicode in PowerShell and Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10) for discussion of displaying UTF-8 in Windows terminals.

Your current configuration is one in which 0xc3 encodes ├, which is probably CP850, which I believe is the default for some of the mingw-based terminals (MSYS, git bash). It's been a very long time since I've used mingw, but you may also want to see How to set console encoding in MSYS?

Rob Napier
  • 286,113
  • 34
  • 456
  • 610
  • Does VSCode use Windows Powershell? – Alexander Jonsson Jul 01 '23 at 14:24
  • It depends on how you set it up. I don't know what the defaults are if you've configured it with mingw. I never use VSCode on Windows. If your issue is about vscode specifically, you may want to add the appropriate tag and adjust your question to get input from experts in that. My expertise is in C and UTF-8. – Rob Napier Jul 01 '23 at 14:31