-2

how i can to read and process utf-8 characters in one char in c from the file this is my code

FILE *file = fopen(fileName, "rb");
    char *code;
    size_t n = 0;
    if (file == NULL) return NULL;
    fseek(file, 0, SEEK_END);
    long f_size = ftell(file);
    fseek(file, 0, SEEK_SET);
    code = malloc(f_size);
    char a,b;
    while (!feof(file)) {
        fscanf(file, "%c", &a);
        code[n++] = a;
       // i want to modify "a" (current char) in here
    }
    code[n] = '\0'; 

this is file content

~”م‘‎iاk·¶;R0ثp9´ -پ‘“گAéI‚sہئzOU,HدلKŒ©َض†ُ­ ت6‘گA=…¢¢³qد4â9àr}hw O‍Uجy.4a³‎M;£´`د$r(q¸Œçً£F 6pG|ںJr(TîsشR

  • The `char` type can hold numbers 0 to 255 or -128 to 127. To properly process Unicode text you need a type capable of holding characters 0 to 1114111. Surrogate pairs do save a little, allowing you to limit youself to characters 0 to 65535. – Paul Stelian Jul 14 '16 at 17:33
  • 1
    You're reading one byte at a time, and UTF-8 is a variable-width encoding. You're going to be pretty limited in how you can modify the characters without trying to parse the input as UTF-8. I would try to find a library that lets you read and process UTF-8 text. – yellowantphil Jul 14 '16 at 17:35
  • 1
    See [while(!feof(fp)) is always wrong](http://stackoverflow.com/a/5432517/3386109), and [info on how to decode UTF-8](https://en.wikipedia.org/wiki/UTF-8#Description) – user3386109 Jul 14 '16 at 17:39
  • tnx my friend ; what is the library name in c language – Siavash Unesi Jul 14 '16 at 19:52
  • @SiavashUnesi: The file content you have shown in your question is *NOT* UTF-8. It is not even textual data. It looks more like binary data instead. – Remy Lebeau Jul 15 '16 at 05:24

1 Answers1

-2

Chars can commonly hold 255 different values (1 byte), or in other words, just the ASCII table (it could use the extended table if you make it unsigned). For handling UTF-8 characters i would recommend using another type like wchar_t (if a wide character in your compiler means as an UTF-8), otherwise use char_32 if you're using C++11, or a library to deal with your data like ICU.


Edit

This example code explains how to deal with UTF-8 in C. Note that you have to make sure that wchar_t in your compiler can store an UTF-8.

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>
main() {
    FILE *file=fopen("Testing.txt", "r, ccs=UTF-8");
    wchar_t sentence[100000], ch=1;
    int n=0;
    char*loc = setlocale(LC_ALL, "");
    printf("Locale set to: %s\n", loc);
    if(file==NULL){
        printf("Error processing file\n");
    } else {
        while((ch = fgetwc(file)) != 65535){
            /* The end of file value may vary depending of the wchar_t!*/
            /* wprintf(L"%lc", ch); */
            sentence[n]=ch+1; /*Example modification*/
            n++;
                }
    }
    fclose(file);
    file=fopen("Testing.txt", "w, ccs=UTF-8");
    fputws(sentence, file);
    wprintf(L"%ls", sentence);
    fclose(file);
    return 0;
}
  • Your system locale
    The char*loc = setlocale(LC_ALL, ""); will help you see your current system locale. Make sure is in UTF-8 if your using linux, if you're using windows then you'll have to stick to one language. This is not a problem if you don't want to print the characters.
  • How to open the file
    Firstly, I opened it for reading it as text file instead of reading it as binary file. Also I have to open the file using the UTF-8 formating (I think in linux it will be as your locale, so the ccs=UTF-8 won't be necessary). Even though in windows we're stuck with one language, the file still has to be read in UTF-8.
  • Using compatible functions with the characters
    For this we'll use the functions inside the wchar.h library (like wprintf and fgetwc). The problem with the other functions is that they are limited to the range of a char, giving the wrong value.

I used as an example this:

¿khñà?
hello
~”م‘‎iاk·¶;R0ثp9´ -پ‘“گAéI‚sہئzOU,HدلKŒ©َض†ُ­ ت6‘گA=…¢¢³qد4â9àr}hw O‍Uجy.4a³‎M;£´`د$r(q¸Œçً£F 6pG|ںJr(TîsشR

In the last part of the program It overwrites the file with the acumulated modified string.
You could try changing sentence[n]=ch+1; to sentence[n]=ch; to check in your original file if it reads and outputs the file correctly (and uncomment the wprintf to check the output).

Community
  • 1
  • 1
  • tnx my friend may rewrite above code to **wchar_t** ; i change it to **wchar_t** but result was same as **char** result – Siavash Unesi Jul 14 '16 at 19:46
  • 2
    Suggest `char32_t` instead of `wchar_t`. Else insure `__STDC_ISO_10646__` is set. – chux - Reinstate Monica Jul 14 '16 at 20:05
  • 1
    "Chars can only hold from 0 to 255 (1 byte), or in other words, just the ASCII table. " is incorrect on 3 points. 1) The range of `char` is often -128 to 127, it is implementation defined and matches either `signed char` or `unsigned char`. 2) The range on rare machines is more than 8 bits 3) ASCII is only defined for codes 0 - 127. – chux - Reinstate Monica Jul 14 '16 at 20:07
  • Thanks for the feedback, and sorry for a novice answer. – sicalmforgost Jul 15 '16 at 03:44
  • The `ccs=UTF-8` flag is a Microsoft-specific extension of `fopen()` in Visual Studio. Other vendor implementations of `fopen()` do not have that extension. – Remy Lebeau Jul 15 '16 at 05:24
  • tnxxxx very much my friend "sicalmforgost" ; it work <3 – Siavash Unesi Jul 15 '16 at 11:39