Reading text file with non-english character in C

Question

Is it possible to read a text file hat has non-english text?

Example of text in file:

E 37

SVAR:

Fettembolisyndrom. (1 poäng)

Example of what is present in buffer which stores "fread" output using "puts" :

E 37 SVAR:

Fettembolisyndrom. (1 po├ñng)

Under Linux my program was working fine but in Windows I am seeing this problem with non-english letters. Any advise how this can be fixed?

Program:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

int debug = 0;

int main(int argc, char* argv[])
{
    if (argc < 2)
    {
        puts("ERROR! Please enter a filename\n");
        exit(1);
    }
    else if (argc > 2)
    {
        debug = atoi(argv[2]);
        puts("Debugging mode ENABLED!\n");
    }

    FILE *fp = fopen(argv[1], "rb");
    fseek(fp, 0, SEEK_END);
    long fileSz = ftell(fp);
    fseek(fp, 0, SEEK_SET);

    char* buffer;
    buffer = (char*) malloc (sizeof(char)*fileSz);
    size_t readSz = fread(buffer, 1, fileSz, fp);
    rewind(fp);

    if (readSz == fileSz)
    {
        char tmpBuff[100];
        fgets(tmpBuff, 100, fp);

        if (!ferror(fp))
        {
            printf("100 characters from text file: %s\n", tmpBuff);
        }
        else
        {
            printf("Error encounter");
        }
    }

    if (strstr("FRÅGA",buffer) == NULL)
    {
        printf("String not found!");
    }

    return 0;
}

Sample output

Text file

Make sure your file is encoded as UTF-8 and your console expects UTF-8. Apparently that’s done on Windows like this: https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how — Ry-, Jan 18 '17 at 22:21
Det är förmodligen inget fel på ditt program. Det du ser i DOS burken har med code page inställningen att göra. Du får ställa in code page 851. — Laszlo, Jan 18 '17 at 22:23
(Do not use wide characters [wchar]. It doesn’t save you from having to deal with encodings and will make things worse.) — Ry-, Jan 18 '17 at 22:30
I have added code sample and screenshoit of program output and text file that is used. I am trying to search for string "FRÅGA" which as you can see is present in the file but it is not found. I opened and checked in Notepad++ and file encoding is UTF-8 — mohxinn, Jan 19 '17 at 12:46
@mohxinn: What character encoding is used by the input file? What character encoding is expected by the console? Are they the same? If not, you must transcode the characters before displaying them. — AlexP, Jan 19 '17 at 15:18

AlexP · Answer 1 · 2017-01-19T15:27:17.267

1

Summary: If you read text from a file encoded in UTF-8 and display it on the console you must either set the console to UTF-8 or transcode the text from UTF-8 to the encoding used by the console (in English-speaking countries, usually MS-DOS code page 437 or 850).

Longer explanation

Bytes are not characters and characters are not bytes. The char data type in C holds a byte, not a character. In particular, the character Å (Unicode <U+00C5>) mentioned in the comments can be represented in many ways, called encodings:

In UTF-8 it is two bytes, '\xC3' '\x85';
In UTF-16 it is two bytes, either '\xC5' '\x00' (little-endian UTF-16), or '\x00' '\xC5' (big-endian UTF-16);
In Latin-1 and Windows-1252, it is one byte, '\xC5';
In MS-DOS code page 437 and code page 850, it is one byte, '\x8F'.

It is the responsibility of the programmer to translate between the internal encoding used by the program (usually but not always Unicode), the encoding used in input or output files, and the encoding expected by the display device.

Note: Sometimes, if the program does not do much with the characters it reads and outputs, one can get by just by making sure that the input files, the output files, and the display device all use the same encoding. In Linux, this encoding is almost always UTF-8. Unfortunately, on Windows the existence of multiple encodings is a fact of life. System calls expect either UTF-16 or Windows-1252. By default, the console displays Code Page 437 or 850. Text files are quite often in UTF-8. Windows is old and complicated.

edited Jan 19 '17 at 15:27

answered Jan 19 '17 at 15:12

AlexP

4,370
15
15

Nice explanation but it doesn't answer the question. – Carey Gregory Jan 19 '17 at 15:15
@CareyGregory: The question cannot be answered unless we know what encoding is used in the input file and what encoding is used by the console. That's the point of the explanation. – AlexP Jan 19 '17 at 15:17
He said it's UTF-8 in the comments. – Carey Gregory Jan 19 '17 at 15:19
1

@CareyGregory: The input file is UTF-8, but the console isn't. The console looks like CP437 or CP850. I have made a comment asking for clarification. – AlexP Jan 19 '17 at 15:20
As shown in the screenshot from Notepad++, the encoding on the file is UTF-8 @AlexP How do I check the encoding expected by the console? Is it possible to change the encoding via the CodeBlocks IDE that I am using? I am using Windows 10 if that is of any consequence :) – mohxinn Jan 19 '17 at 19:11
Ok I manage to figure out what codepage is used and it seems to be 850 C:\Users\mohsi>chcp Active code page: 850 What should be the codepage set to for UTF-8? Is it 65001 as mentioned here: http://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8 – mohxinn Jan 19 '17 at 19:16

Reading text file with non-english character in C

1 Answers1

Longer explanation