27

My setup: gcc-4.9.2, UTF-8 environment.

The following C-program works in ASCII, but does not in UTF-8.

Create input file:

echo -n 'привет мир' > /tmp/вход

This is test.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 10

int main(void)
{
  char buf[SIZE+1];
  char *pat = "привет мир";
  char str[SIZE+2];

  FILE *f1;
  FILE *f2;

  f1 = fopen("/tmp/вход","r");
  f2 = fopen("/tmp/выход","w");

  if (fread(buf, 1, SIZE, f1) > 0) {
    buf[SIZE] = 0;

    if (strncmp(buf, pat, SIZE) == 0) {
      sprintf(str, "% 11s\n", buf);
      fwrite(str, 1, SIZE+2, f2);
    }
  }

  fclose(f1);
  fclose(f2);

  exit(0);
}

Check the result:

./test; grep -q ' привет мир' /tmp/выход && echo OK

What should be done to make UTF-8 code work as if it was ASCII code - not to bother how many bytes a symbol takes, etc. In other words: what to change in the example to treat any UTF-8 symbol as a single unit (that includes argv, STDIN, STDOUT, STDERR, file input, output and the program code)?

Igor Liferenko
  • 1,499
  • 1
  • 13
  • 28

5 Answers5

17
#define SIZE 10

The buffer size of 10 is insufficient to store the UTF-8 string привет мир. Try changing it to a larger value. On my system (Ubuntu 12.04, gcc 4.8.1), changing it to 20, worked perfectly.

UTF-8 is a multibyte encoding which uses between 1 and 4 bytes per character. So, it is safer to use 40 as the buffer size above. There is a big discussion at How many bytes does one Unicode character take? which might be interesting.

Kyrol
  • 3,475
  • 7
  • 34
  • 46
Siddhartha Ghosh
  • 2,988
  • 5
  • 18
  • 25
10

Siddhartha Ghosh's answer gives you the basic problem. Fixing your code requires more work, though.

I used the following script (chk-utf8-test.sh):

echo -n 'привет мир' > вход
make utf8-test
./utf8-test
grep -q 'привет мир' выход && echo OK

I called your program utf8-test.c and amended the source like this, removing the references to /tmp, and being more careful with lengths:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 40

int main(void)
{
    char buf[SIZE + 1];
    char *pat = "привет мир";
    char str[SIZE + 2];

    FILE *f1 = fopen("вход", "r");
    FILE *f2 = fopen("выход", "w");

    if (f1 == 0 || f2 == 0)
    {
        fprintf(stderr, "Failed to open one or both files\n");
        return(1);
    }

    size_t nbytes;
    if ((nbytes = fread(buf, 1, SIZE, f1)) > 0)
    {
        buf[nbytes] = 0;

        if (strncmp(buf, pat, nbytes) == 0)
        {
            sprintf(str, "%.*s\n", (int)nbytes, buf);
            fwrite(str, 1, nbytes, f2);
        }
    }

    fclose(f1);
    fclose(f2);

    return(0);
}

And when I ran the script, I got:

$ bash -x chk-utf8-test.sh
+ '[' -f /etc/bashrc ']'
+ . /etc/bashrc
++ '[' -z '' ']'
++ return
+ alias 'r=fc -e -'
+ echo -n 'привет мир'
+ make utf8-test
gcc -O3 -g -std=c11 -Wall -Wextra -Werror utf8-test.c -o utf8-test
+ ./utf8-test
+ grep -q 'привет мир' $'в?\213?\205од'
+ echo OK
OK
$

For the record, I was using GCC 5.1.0 on Mac OS X 10.10.3.

Community
  • 1
  • 1
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • You forgot `% 11s` in sprintf and leading space in grep. Still, `OK` is not printed. – Igor Liferenko May 22 '15 at 05:51
  • 1
    Oh, I forgot to mention that my compiler objects to the space. (What does it do for you — the message mentioned `gnu_printf`? A space flag is relevant to numeric conversions, but not to string conversions). If I wanted a space at the start, it goes before the `%`. And I did not forget the 11; I changed the `11` into `.*` and passed the correct number of bytes as an `int` argument to `printf()`. You are not using wide characters; you are using byte strings, and UTF-8 characters are of variable width, though apart from the space, yours are all 2 bytes long in UTF-8. You have to work with bytes. – Jonathan Leffler May 22 '15 at 05:58
  • The exact error message I got was: `utf8-test.c:23:20: error: ' ' flag used with ‘%s’ gnu_printf format [-Werror=format=]`. The short answer is that you can't make UTF-8 work like ASCII using the techniques you're trying to use. If you use wide character variables and functions, you're in with a chance, but not if you're using `char` and the byte-based functions. – Jonathan Leffler May 22 '15 at 05:59
  • My compiler compiles it without any error. With this space I want to align to 11-column width with leading space, while the string is 10 symbols - this is to check that C regards my symbols as whole units, not as separate bytes - the test which works perfectly for ASCII. Not sure how to reproduce similar test that it will work with your compiler. – Igor Liferenko May 22 '15 at 06:02
  • 1
    If it worked perfectly, why are you asking this question? It didn't work perfectly, did it? Compiling without error is meaningless unless you identify the compiler options you're using. GCC will (by design) accept the most appalling code without complaining by default. – Jonathan Leffler May 22 '15 at 06:06
  • To illustrate the idea of my test, compare the outputs of the following two commands (copy them verbatim): `echo -n A|perl -e 'printf"%2s\n",<>'` and `echo -n А|perl -e 'printf"%2s\n",<>'` Try in your compiler without the space in format string, e.g., `printf("%6s\n","ASCII");` - will it print leading space? This check is necessary to fully check all the points of my qestion in your answer. Thank you. – Igor Liferenko May 22 '15 at 06:32
  • 1
    I've been working out what the difference is between the two commands. It's horridly subtle. What looks like A in both is in fact two different characters: an ordinary Unicode U+0041 LATIN CAPITAL LETTER A in the first, but U+0410 CYRILLIC CAPITAL LETTER A in the second. When I run the scripts, then the first command prints space A, but the second prints just А. I'm not sure what this shows other than that Perl by default does not understand UTF-8. Note that the [Perl Unicode](http://perldoc.perl.org/perlunicode.html) documentation discusses 'Byte and Character Semantics'. – Jonathan Leffler May 22 '15 at 06:34
  • I did this deliberately - and this is what for leading space in grep was needed all along. Please check `printf("%6s\n","ASCII");` in your compiler. – Igor Liferenko May 22 '15 at 06:36
  • 1
    I reserve judgement on what your code shows. The use of `%11s` completely throws a spanner in the works, AFAICS. I'm certainly completely unsure of what it is supposed to demonstrate. It will take me time — probably multiple days time, given other commitments, such as work — to find out what's going on and how to work around it. Suffice to say that `printf()` works with single-byte code sets, and tolerates UTF-8 but is unaware of what it means and still counts bytes, not characters. Working with characters requires a lot more work. I tried some wide character code and it failed. …Time… – Jonathan Leffler May 22 '15 at 06:39
  • 1
    Since your string `ASCII` is all regular Latin characters, it will be printed right-justified in a field 6 characters wide, which means there'll be a space at the start. That simply demonstrates what `printf()` does with regular text. Trying to hustle me simply isn't going to work; I need to think about what you're up to from first principles, because nothing else is going to work for me. But I do know that your code is dubious; it crashed when I ran it. I'm unable to give you a better answer now; it is bedtime in this time zone, so it'll be a good many hours before I can do anything. – Jonathan Leffler May 22 '15 at 06:45
  • Perl wants `-CSD` flags to enable the Unicode functionality. Out of the box, it has to be compatible with legacy Perl, which didn't have this functionality. – tripleee May 22 '15 at 11:34
9

This is more of a corollary to the other answers, but I'll try to explain this from a slightly different angle.

Here is Jonathan Leffler's version of your code, with three slight changes: (1) I made explicit the actual individual bytes in the UTF-8 strings; and (2) I modified the sprintf formatting string width specifier to hopefully do what you are actually attempting to do. Also tangentially (3) I used perror to get a slightly more useful error message when something fails.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 40

int main(void)
{
  char buf[SIZE + 1];
  char *pat = "\320\277\321\200\320\270\320\262\320\265\321\202"
    " \320\274\320\270\321\200";  /* "привет мир" */
  char str[SIZE + 2];

  FILE *f1 = fopen("\320\262\321\205\320\276\320\264", "r");  /* "вход" */
  FILE *f2 = fopen("\320\262\321\213\321\205\320\276\320\264", "w");  /* "выход" */

  if (f1 == 0 || f2 == 0)
    {
      perror("Failed to open one or both files");  /* use perror() */
      return(1);
    }

  size_t nbytes;
  if ((nbytes = fread(buf, 1, SIZE, f1)) > 0)
    {
      buf[nbytes] = 0;

      if (strncmp(buf, pat, nbytes) == 0)
        {
          sprintf(str, "%*s\n", 1+(int)nbytes, buf);  /* nbytes+1 length specifier */
          fwrite(str, 1, 1+nbytes, f2); /* +1 here too */
        }
    }

  fclose(f1);
  fclose(f2);

  return(0);
}

The behavior of sprintf with a positive numeric width specifier is to pad with spaces from the left, so the space you tried to use is superfluous. But you have to make sure the target field is wider than the string you are printing in order for any padding to actually take place.

Just to make this answer self-contained, I will repeat what others have already said. A traditional char is always exactly one byte, but one character in UTF-8 is usually not exactly one byte, except when all your characters are actually ASCII. One of the attractions of UTF-8 is that legacy C code doesn't need to know anything about UTF-8 in order to continue to work, but of course, the assumption that one char is one glyph cannot hold. (As you can see, for example, the glyph п in "привет мир" maps to the two bytes -- and hence, two chars -- "\320\277".)

This is clearly less than ideal, but demonstrates that you can treat UTF-8 as "just bytes" if your code doesn't particularly care about glyph semantics. If yours does, you are better off switching to wchar_t as outlined e.g. here: http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

However, the standard wchar_t is less than ideal when the standard expectation is UTF-8. See e.g. the GNU libunistring documentation for a less intrusive alternative, and a bit of background. With that, you should be able to replace char with uint8_t and the various str* functions with u8_str* replacements and be done. The assumption that one glyph equals one byte will still need to be addressed, but that becomes a minor technicality in your example program. An adaptation is available at http://ideone.com/p0VfXq (though unfortunately the library is not available on http://ideone.com/ so it cannot be demonstrated there).

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Actually, I was asking how to use normal UTF-8 in my program, i.e., how to accomplish in `C` the equivalent of `perl -CSDA -Mutf8` Your example does not address my question, although the link that you provided is definitely on the subject. – Igor Liferenko May 23 '15 at 13:08
  • Added another brief paragraph about an alternative to `wchar_t`. – tripleee May 25 '15 at 11:49
3

The following code works as required:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>

#define SIZE 10

int main(void)
{
  setlocale(LC_ALL, "");
  wchar_t buf[SIZE+1];
  wchar_t *pat = L"привет мир";
  wchar_t str[SIZE+2];

  FILE *f1;
  FILE *f2;

  f1 = fopen("/tmp/вход","r");
  f2 = fopen("/tmp/выход","w");

  fgetws(buf, SIZE+1, f1);

  if (wcsncmp(buf, pat, SIZE) == 0) {
    swprintf(str, SIZE+2, L"% 11ls", buf);
    fputws(str, f2);
  }

  fclose(f1);
  fclose(f2);

  exit(0);
}
Igor Liferenko
  • 1,499
  • 1
  • 13
  • 28
0

Probably your test.c file is not stored in UTF-8 format and for that reason "привет мир" string is ASCII - and the comparison failed. Change text encoding of source file and try again.

i486
  • 6,491
  • 4
  • 24
  • 41