0

I am trying to read and manipulate Urdu text from files. However it seems that a character is not read whole into the wchar_t variable. Here is my code that reads text and prints each character in a new line:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");
    printf("This program tests Urdu reading:\n");
    wchar_t c;
    FILE *f = fopen("urdu.txt", "r");
    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc\n", c);
    }
    fclose(f);
}

And here is my sample text:

میرا نام ابراھیم ھے۔

میں وینڈربلٹ یونیورسٹی میں پڑھتا ھوں۔

However there seem to be twice as many characters printed as there are letters in the text. I understand that wide or multi-byte characters use multiple bytes, but I thought that the wchar_t type would store all the bytes corresponding to a letter in the alphabet together.

How can I read the text so that at any one time, I have a whole character stored in a variable?

Details about my environment:
gcc: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 5.3.0
OS: Windows 10 64 bit
Text file encoding: UTF-8

This is how my text looks in hex format:

d9 85 db 8c d8 b1 d8 a7 20 d9 86 d8 a7 d9 85 20 d8 a7 d8 a8 d8 b1 d8 a7 da be db 8c d9 85 20 da be db 92 db 94 ad 98 5d b8 cd ab a2 0d 98 8d b8 cd 98 6d a8 8d 8b 1d 8a 8d 98 4d 9b 92 0d b8 cd 98 8d 98 6d b8 cd 98 8d 8b 1d 8b 3d 9b 9d b8 c2 0d 98 5d b8 cd ab a2 0d 9b ed a9 1d ab ed 8a ad 8a 72 0d ab ed 98 8d ab ad b9 4a
hazrmard
  • 3,397
  • 4
  • 22
  • 36
  • you'll need to give more details, such as the encoding of the text file, what compiler and OS you are using – M.M Jul 17 '16 at 04:37
  • Thanks, added details at the end. gcc 5.3.0, Win 10, UTF-8 – hazrmard Jul 17 '16 at 04:40
  • 1
    Possible duplicate of [Light C Unicode Library](http://stackoverflow.com/questions/313555/light-c-unicode-library) –  Jul 17 '16 at 04:44
  • 1
    [Not all types are created equal](http://stackoverflow.com/questions/17871880/should-i-use-wchar-t-when-using-utf-8) – txtechhelp Jul 17 '16 at 05:05
  • Testing on Mac OS X 10.11.5 with GCC 6.1.0, and using a cut'n'paste of your text into a file, I get sane behaviour. The file contains a series of UTF-8 characters in the range U+0627 .. U+06D4 (plus a few regular characters — space U+0020 and newline U+000A), and no byte-order mark since they're irrelevant on Unix and in UTF-8. What exactly is the byte content of your data file? – Jonathan Leffler Jul 17 '16 at 05:17
  • 1
    @Amd that might be useful info but certainly not a duplicate – M.M Jul 17 '16 at 05:18
  • for starters, in Windows, `wchar_t` is 16-bit, so (if everything is working properly - which may require further configuration) it would get UTF-16 characters. Further, I assume you're outputting to the Windows console, which doesn't know about UTF-16 (and has to be cajoled even into handling UTF-8). It's basically impossible to output properly to Windows console, you'll have to use a third party console, or output to a GUI. – M.M Jul 17 '16 at 05:22
  • 1
    To help find out what's going on you could output the character codes of each character – M.M Jul 17 '16 at 05:23
  • I added hex codes of my source text file. – hazrmard Jul 17 '16 at 05:33
  • The first 20 characters of your hex data are valid as UTF-8 (and map to Urdu characters). After the 0xDB 0x92 0xDB 0x94 sequence, the encoding ceases to be valid UTF-8. However, treated as UTF-16, the values seem to be in the surrogates range (0xD800..0xDBFF for high surrogates and 0xDC00..0xDFFF for low surrogates). That makes the data puzzling. – Jonathan Leffler Jul 17 '16 at 06:10
  • @JonathanLeffler Looks like a transcription error. Instead of `ad 98 5d b8 cd ab a2` etc there should be `0a d9 85 db 8c da ba` etc. – n. m. could be an AI Jul 17 '16 at 06:29
  • @M.M Windows console is perfectly capable of both UTF-8 and UTF-16. [Here's a screensgot of a slightly fixed OP's program, built with cygwin gcc](http://imgur.com/2SCpuGJ). No special hoops jumped. There are no Urdu characters because there are no fonts, but other scripts are OK. – n. m. could be an AI Jul 17 '16 at 06:47
  • IIRC `wchar_t` on Windows is 16-bits and uses UTF-16 encoding. – davmac Jul 17 '16 at 07:42

2 Answers2

1

Windows support for Unicode is mostly proprietary and it is impossible to write portable software that uses UTF-8 and works on Windows using Windows native libraries. If you are willing to consider non-portable solutions, here is one:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <fcntl.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");

    // Next line is needed to output wchar_t data to the console. Note that 
    // Urdu characters are not supported by standard console fonts. You may
    // have to install appropriate fonts to see Urdu on the console.
    // Failing that, redirecting to a file and opening with a text editor
    // should show Urdu characters.

    _setmode(_fileno(stdout), _O_U16TEXT);

    // Mixing wide-character and narrow-character output to stdout is not
    // a good idea. Using wprintf throughout. (Not Windows-specific)

    wprintf(L"This program tests UTF-8 reading:\n");

    // WEOF is not guaranteed to fit into wchar_t. It is necessary
    // to use wint_t to keep a result of fgetwc, or to print with
    // %lc. (Not Windows-specific)

    wint_t c;

    // Next line has a non-standard parameter passed to fopen, ccs=...
    // This is a Windows way to support different file encodings.
    // There are no UTF-8 locales in Windows. 

    FILE *f = fopen("urdu.txt", "r,ccs=UTF-8");

    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc", c);
    }
    fclose(f);
}

OTOH with glibc (e.g. using cygwin) these Windows extensions are not needed because glibc handles these things internally.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
0

UTF-8 is encoding for Unicode that takes from 1-4 bytes per character. I was able to store each unicode character in a uint32_t (or u_int32_t on some UNIX platforms) variable. The library I used is (utf8.h | utf8.c). It provides some conversion and manipulation functions for UTF-8 strings.

So if a file is n bytes in UTF-8, at most it will have n Unicode characters. Which means I need a memory of 4*n bytes (4 bytes per u_int32_t variable) to store the contents of the file.

#include "utf8.h"

// here read contents of file into a char* => buff
// keep count of # of bytes read => N

ubuff = (u_int32_t*) calloc(N, sizeof(u_int32_t));  // calloc initializes to 0
u8_toucs(ubuff, N, buff, N);

// ubuff now is an array of 4-byte integers representing
// a Unicode character each

Of course, it is entirely possible that there will be less than n Unicode characters in the file if multiple bytes represent a single character. This means that the 4*n memory allocation is too much. In that case a chunk of ubuff will be 0 (Unicode Null character). So I simply scan the array and reallocate memory as needed:

u_int32_t* original = ubuff;
int sz=0;
while *ubuff != 0 {
    ubuff++;
    sz++;
}
ubuff = realloc(original, sizeof(*original) * i);

Note: If you get type errors about u_int32_t, put typedef uint32_t u_int32_t; at the beginning of your code.

hazrmard
  • 3,397
  • 4
  • 22
  • 36