Why does fgetc() in C always reads extra, non-existent characters whenever I try to read non-printable characters from txt files?

Question

I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.

However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.

For example, the character I want to read is "§". And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".

I looked it up in the debugger and saw "Â§" in the char array. But I don't have Â anywhere in my input file. screenshot of debugger

And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "Â§".

There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?

Thanks!

Code as follows:

        case 1:
            mode = 1;
            FILE *fp;
            fp = fopen ("input2.txt", "r");
            int charCount = 0;

            while(!feof(fp)) {
                original_message[charCount] = fgetc(fp);
                charCount++;
            }
            original_message[charCount - 1] = '\0';
            fclose(fp);

            k = strlen(original_message);//split the original message into k input symbols
            printf("k: \n%lld\n", k);

            printf("ASCII code:\n");
            for (int i = 0; i < k; i++)
            {
                ASCII = original_message[i];
                printf("%d ", ASCII);
            }

You're using `feof` incorrectly: https://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong — William Pursell, Feb 10 '22 at 03:20
Looks like the file you're reading is UTF-8. The C language was created back in the ancient days of 8-bit character sets and is the worst choice for reading files that contain Unicode. — Lee Daniel Crocker, Feb 10 '22 at 03:21
Use a hex editor to view the text file. I bet it's really two bytes on disk. Whatever program you used to generate the text file is using UTF-8 encoding, and in that encoding, the `§` character is two bytes long. You seem to be expecting ISO/IEC 8859-1 encoding (where `§` is a single byte). Either adapt your program to translate UTF-8 to ISO/IEC 8859-1 or convince the program that is generating the text file to use ISO/IEC 8859-1. Your main confusion is calling `§` "ASCII". It is not ASCII. ASCII stops at 127. — Raymond Chen, Feb 10 '22 at 03:39
“worst choice”? As opposed to what? Apple Pascal? C is no better or worse than any other language at handling Unicode. As it is, the biggest, most prolific library for handling Unicode is written in C: ICU4C. — Dúthomhas, Feb 10 '22 at 03:39
`§` is not a "non-printing character"; it is a non-*ASCII* character. If your system were using a single-byte extended character set like [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1), you could read `§` as one byte (it'd be 167). But it looks like your system is using [Unicode](https://en.wikipedia.org/wiki/Unicode), in its multibyte rendition [UTF-8](https://en.wikipedia.org/wiki/UTF-8). — Steve Summit, Feb 10 '22 at 04:48
Thanks for all the replies. Is there any way I can print out the Unicode string of §? for non-ASCII, how can I print something like 0x265e instead of decimal 9822? Sorry if I sound dumb — Yooshinhee, Feb 10 '22 at 05:03
To print an integer value in hexadecimal, just use `%x` instead of `%d`. — Steve Summit, Feb 10 '22 at 05:20
@LeeDanielCrocker: You'd be surprised how few programming languages *actually* understand Unicode even half as well as they claim they do. — DevSolar, Feb 10 '22 at 15:59

Steve Summit · Accepted Answer · 2022-02-10T15:55:34.987

C's getchar (and getc and fgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.

But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp) with fgetwc(fp), and then you should be able to start reading characters like § as themselves.

You will have to #include <wchar.h> to get the prototype for fgetwc. And you may have to add the call

setlocale(LC_CTYPE, "");

at the top of your program to synchronize your program's character set "locale" with that of your operating system.

Not your original code, but I wrote this little program:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main()
{
    wchar_t c;
    setlocale(LC_CTYPE, "");
    while((c = fgetwc(stdin)) != EOF)
        printf("%lc %d\n", c, c);
}

When I type "A", it prints A 65. When I type "§", it prints § 167. When I type "Ƶ", it prints Ƶ 437. When I type "†", it prints † 8224.

Now, with all that said, reading wide characters using functions like fgetwc isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:

Your original_message array is going to have to be an array of wchar_t, not an array of char.
Your original_message array isn't going to be an ordinary C string — it's a "wide character string". So you can't call strlen on it; you're going to have to call wcslen.
Similarly, you can't print it using %s, or its characters using %c. You'll have to remember to use %ls or %lc.

So although you can convert your entire program to use "wide" strings and "w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.

Hello, thank you for the answer. I am just wondering, 'cause I do need to split the string into their characters, then how would I deal with them? For example, if I need to read Korean characters from a txt file, write these Korean characters into a char array one by one, and deal with each character separately somewhere else in my function. Finally, write these Korean characters into another txt file one by one, what would be a fast and easy way? Thanks! — Yooshinhee, Feb 10 '22 at 17:46
@Yooshinhee Another excellent function to know about is [`mbtowc`](https://linux.die.net/man/3/mbtowc), and the related [`mbstowcs`](https://linux.die.net/man/3/mbstowcs). These let you extract one or more "wide" Unicode characters from a multibyte string. You might also find some tips at [this question](https://stackoverflow.com/questions/4607413/), although it's talking about converting in the opposite direction. — Steve Summit, Feb 10 '22 at 18:30

Why does fgetc() in C always reads extra, non-existent characters whenever I try to read non-printable characters from txt files?

1 Answers1