C read and write unsigned char (0 - 255) as UTF-8

Question

I am trying to read and write unsigned char (0 - 255) extended ASCII characters (unicode) from and to console under windows (cross platform compatibility is needed) in C.

Under extended ASCII (unicode), code-point 255 is ÿ and code-point 220 is Ü.

Right now I have the following code for writing and reading.

#include<stdio.h>
#include<locale.h>

int main() {
    setlocale(LC_ALL, "");

    unsigned char ch = 255;
    wprintf(L"Character %d = %lc\n", ch, ch);

    wprintf(L"Enter a character: ");
    wscanf(L"%lc", &ch);
    wprintf(L"Character %d = %lc\n", ch, ch);

    return 0;
}

The output is:

Character 255 = ÿ
Enter a character: ÿ
Character 220 = Ü

As evident, code-point 255 is displayed properly as ÿ. However, when taking ÿ as input, it is being read as code-point 220. Consequently, when code-point 220 is printed, it is displayed as Ü.

Thus, the writing is working fine. However, while reading, when the ASCII characters are above 127 (128 - 255), the read code-point is 36 less than the actual value.

Can you please help me understand what I am doing wrong and how I can fix this.

@para Perhaps I made a mistake in wording. I have changed the term from UTF-8 to Unicode. What I mean is that I need to use first 256 (0 - 255) characters of Unicode in unsigned char and read it from console and print back to console. — Pratanu Mandal, Sep 30 '20 at 22:15
`wscanf` with `%lc` for an `unsigned char` is likely to lead to undefined behavior. — Mark Ransom, Sep 30 '20 at 22:53
@MarkRansom I am using Windows with the DOS terminal. The encoding of the terminal is CP437. — Pratanu Mandal, Oct 01 '20 at 11:18

score 1 · Answer 1 · answered Oct 01 '20 at 03:08

1

%lc takes a wide character wchar_t, wide refers to it being multi-byte, but the exact size is implementation specific. Giving it a 1 byte unsigned char will cause odd behavior as it will read a byte or two extra.

But if you're using 1 byte characters you don't need to use wprintf nor wscanf. Just use printf and scanf.

And, as noted by others, "extended ASCII" is not "Unicode". See this question for more.

answered Oct 01 '20 at 03:08

Schwern

153,029
25
195
336

I have tried using printf and scanf. However, it produces the same result. Basically while reading, it reads as CP437 (which is the terminal encoding). But we need to use unicode instead. And what I mean by unicode is that I need to address first 256 characters of unicode. – Pratanu Mandal Oct 01 '20 at 07:17
@PratanuMandal Which Unicode encoding? UTF-8? UTF-16? Or do you mean the first 256 code points up to U+00FF? I don't mean to be obtuse, character encoding is complicated. – Schwern Oct 01 '20 at 09:31
@PratanuMandal Ok, what do you want to do with them? So far you're just reading a character and printing it back out. – Schwern Oct 01 '20 at 09:36
I read the character (between 0 to 255), and store it. Then there might be operations (addition or subtraction) on that value based on certain conditions. After the operations, the final result will be printed back to the user. – Pratanu Mandal Oct 01 '20 at 10:15
@PratanuMandal -- where does Unicode fit into this at all? If you read a CP437 character as one unsigned byte, and you want to write a CP437 character as one unsigned byte, why need you care about Unicode at all? – Kevin Boone Oct 01 '20 at 11:06
@KevinBoone To give a proper disclosure if it helps explain the problem better. I am writing an interpreter for BrainF*** language. Here I need to deal with 0 to 255 characters (of unicode to keep it standard across all platforms). When user enters a character into the terminal, I need it to follow unicode first 256 characters no matter what the code page is. – Pratanu Mandal Oct 01 '20 at 11:11
3

You'll need to convert a byte in your platform encoding into a Unicode code point. If your platform encoding is CP437, there's no guarantee that a single byte CP437 will be representable as one byte as a Unicode code point. That is, the code point might be a number > 255. Some CP437 characters do, in fact, map onto code points <= 255, but not all. Then when you write out, you'll need to do the conversion in reverse -- provided that the result of your calculation is actually representable as a CP437 character. This is a relatively complex process, with no clear good reason to implement ;) – Kevin Boone Oct 01 '20 at 11:19
@PratanuMandal there are 75 codepoints in your range that can't be converted to CP437, so what you're asking for is impossible. – Mark Ransom Oct 01 '20 at 14:05
@PratanuMandal As others have said, you can't convert CP437 to any Unicode format in 1 byte. But you don't have to. You're writing a BrainFuck interpreter. You decide what encoding your interpreter accepts. Since you want it to be standard it will probably be UTF-8. Then you treat the input as UTF-8. People writing for your interpreter write in UTF-8, and they will do that using an editor, not a DOS prompt. You *don't* try to guess their encoding and convert it. Also, Brainfuck only uses 8 ASCII characters so encoding is unlikely to be an issue and you can just treat the program as bytes. – Schwern Oct 01 '20 at 20:52
@Schwern Thanks for the help. I have decided to go with the same solution. It is futile to try to control the encoding. Instead it is much easier to just assume that user will provide UTF-8 data. Also, the issue occurs only on Windows; setting system("chcp 1252 > nul") fixes the issue by converting the terminal encoding to use 1252 code page. This works well enough for me. – Pratanu Mandal Oct 01 '20 at 21:18

C read and write unsigned char (0 - 255) as UTF-8

1 Answers1