4

I've got an UTF-8 text file containing several signs that i'd like to change by other ones (only those between |( and |) ), but the problem is that some of these signs are not considered as characters but as multi-character signs. (By this i mean they can't be put between '∞' but only like this "∞", so char * ?)

Here is my textfile :

Text : |(abc∞∪v=|)

For example :

should be changed by ¤c

by ¸!

= changed by "

So as some signs(∞ and ∪) are multicharacters, i decided to use fscanf to get all the text word by word. The problem with this method is that I have to put space between each character ... My file should look like this :

Text : |( a b c ∞ ∪ v = |)

fgetc can't be used because characters like ∞ can't be considered as one single character.If i use it I won't be able to strcmp a char with each sign (char * ), i tried to convert my char to char* but strcmp !=0.

Here is my code in C to help you understanding my problem :

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main(void){
    char *carac[]={"∞","=","∪"}; //array with our signs
    FILE *flot,*flot3;
    flot=fopen("fichierdeTest2.txt","r"); // input text file
    flot3=fopen("resultat.txt","w"); //output file
    int i=0,j=0;
    char a[1024]; //array that will contain each read word.
    while(!feof(flot))
    {
        fscanf(flot,"%s",&a[i]);
        if (strstr(&a[i], "|(") != NULL){ // if the word read contains |(  then j=1
            j=1;
            fprintf(flot3,"|(");
        }
        if (strcmp(&a[i], "|)") == 0)
            j=0;
        if(j==1) { //it means we are between |( and |) so the conversion can begin
            if (strcmp(carac[0], &a[i]) == 0) { fprintf(flot3, "¤c"); }
            else if (strcmp(carac[1], &a[i]) == 0) { fprintf(flot3,"\"" ); }
            else if (strcmp(carac[2], &a[i]) == 0) { fprintf(flot3, " ¸!"); }
            else fprintf(flot3,"%s",&a[i]); // when it's a letter, number or sign that doesn't need to be converted
        }
        else { // when we are not between |( and |) just copy the word to the output file with a space after it
            fprintf(flot3, "%s", &a[i]);
            fprintf(flot3, " ");
        }
        i++;
    }
}

Thanks a lot for the future help !

EDIT : Every sign will be changed correctly if i put a space between each them but without ,it won't work, that's what i'm trying to solve.

BinX
  • 77
  • 1
  • 6
  • What about `fgetwc()`? – ad absurdum Dec 21 '16 at 15:21
  • Good question format, but some minor things: avoid `feof` (http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong), make `j` something different like `is_converting` or something because `j` is usually an iterator. – Dellowar Dec 21 '16 at 15:23
  • Look at using `fread()` rather than `fscanf()`. Since you are using UTF-8 with multi-byte characters you will need to have a mechanism to read the byte stream in and then process the characters one at a time and recognizing multi-byte characters in the UTF-8 stream. See also [UTF8 processing C](http://stackoverflow.com/questions/10948234/utf8-processing-in-c) and see also this blog posting [Using UTF-8 as the internal representation for strings in C and C++ with Visual Studio](http://www.nubaria.com/en/blog/?p=289). – Richard Chambers Dec 21 '16 at 16:59
  • [C: Using scanf and wchar_t to read and print UTF-8 strings](https://linuxprograms.wordpress.com/2012/05/12/using-scanf-and-wchar_t-to-read-and-print-utf-8-strings/) has a short demo program that demonstrates use of `setlocale(LC_ALL, "");` along with the `%ls` format specifier as in `scanf("%ls",string);`. – Richard Chambers Dec 21 '16 at 17:14

2 Answers2

4

First of all, get the terminology right. Proper terminology is a bit confusing, but at least other people will understand what you are talking about.

In C, char is the same as byte. However, a character is something abstract like or ¤ or c. One character may contain a few bytes (that is a few chars). Such characters are called multi-byte ones.

Converting a character to a sequence of bytes (encoding) is not trivial. Different systems do it differently; some use UTF-8, while others may use UTF-16 big-endian, UTF-16 little endian, a 8-bit codepage or any other encoding.

When your C program has something inside quotes, like "∞" - it's a C-string, that is, several bytes terminated by a zero byte. When your code uses strcmp to compare strings, it compares each byte of both strings, to make sure they are equal. So, if your source code and your input file use different encodings, the strings (byte sequences) won't match, even though you will see the same character when examining them!


So, to rule out any encoding mismatches, you might want to use a sequence of bytes instead of a character in your source code. For example, if you know that your input file uses the UTF-8 encoding:

char *carac[]={
    "\xe2\x88\x9e", // ∞
    "=",
    "\xe2\x88\xaa"}; // ∪

Alternatively, make sure the encodings (of your source code and your program's input file) are the same.


Another, less subtle, problem: when comparing strings, you actually have a big string and a small string, and you want to check whether the big string starts with the small string. Here strcmp does the wrong thing! You must use strncmp here instead:

if (strncmp(carac[0], &a[i], strlen(carac[0])) == 0)
{
    fprintf(flot3, "\xC2\xA4""c"); // ¤c
}

Another problem (actually, a major bug): the fscanf function reads a word (text delimited by spaces) from the input file. If you only examine the first byte in this word, the other bytes will not be processed. To fix, make a loop over all bytes:

fscanf(flot,"%s",a);
for (i = 0; a[i] != '\0'; )
{
    if (strncmp(&a[i], "|(", 2)) // start pattern
    {
        now_replacing = 1;
        i += 2;
        continue;
    }
    if (now_replacing)
    {
        if (strncmp(&a[i], whatever, strlen(whatever)))
        {
            fprintf(...);
            i += strlen(whatever);
        }
    }
    else
    {
        fputc(a[i], output);
        i += 1; // processed just one char
    }
}
anatolyg
  • 26,506
  • 9
  • 60
  • 134
  • That really helped to solve my problem ! I've just changed a few things in your code – BinX Dec 21 '16 at 22:06
1

You're on the right track, but you need to look at characters differently than strings.

strcmp(carac[0], &a[i])

(Pretending i = 2) As you know this compares the string "∞" with &a[2]. But you forget that &a[2] is the address of the second character of the string, and strcmp works by scanning the entire string until it hits a null terminator. So "∞" actually ends up getting compared with "abc∞∪v=|)" because a is only null terminated at the very end.

What you should do is not use strings, but expand each character (8 bits) to a short (16 bits). And then you can compare them with your UTF-16 characters

if( 8734 = *((short *)&a[i])) { /* character is infinity */ }

The reason for that 8734 is because that's the UTF16 value of infinity.

VERY IMPORTANT NOTE: Depending if your machine is big-endian or little-endian matters for this case. If 8734 (0x221E) does not work, give 7714 (0x1E22) a try.

Edit Something else I overlooked is you're scanning the entire string at once. "%s: String of characters. This will read subsequent characters until a whitespace is found (whitespace characters are considered to be blank, newline and tab)." (source)

//feof = false.
fscanf(flot,"%s",&a[i]); 
//feof = ture.

That means you never actually iterate. You need to go back and rethink your scanning procedure.

Dellowar
  • 3,160
  • 1
  • 18
  • 37
  • that's assuming UTF16 is the encoding being used on the text files :) OP doesn't specify that. – Ahmed Masud Dec 21 '16 at 15:56
  • 1
    @AhmedMasud good point. While I was writing this question I learned that it is impossible to get the encoding of a text file without guessing. My knowledge is limited, but OP may be in a real pickle without the use of some guessing libraries. – Dellowar Dec 21 '16 at 16:00
  • My text files are written in UTF-8 ! – BinX Dec 21 '16 at 16:18