5

I am parsing a file that involves characters such as æ ø å. If we assume I have stored a line of the text file as follows

#define MAXLINESIZE 1024
char* buffer = malloc(MAXLINESIZE)
...
fgets(buffer,MAXLINESIZE,handle)
...

if I wanted to count the number of characters on a line. If I try to do the following:

char* p = buffer
int count = 0;
while (*p != '\n') {
    if (isgraph(*p)) {
        count++;
    }
    p++;
}

this ignores the any occurrence of æ ø å

ie: counting "aåeæioøu" would return 5 not 8

do I need to read the file in an alternative way? should I not be using a char* but an int*?

Will Vousden
  • 32,488
  • 9
  • 84
  • 95
beoliver
  • 5,579
  • 5
  • 36
  • 72

3 Answers3

3

Let's say you use UTF-8.

You need to understand how UTF-8 works.

Here's a little piece of work which should do what you want :

int nbChars(char *str) {
    int len = 0;
    int i = 0;
    int charSize = 0; // Size of the current char in byte

    if (!str)
        return -1;
    while (str[i])
    {
        if (charSize == 0)
        {
            ++len;
            if (!(str[i] >> 7 & 1)) // ascii char
                charSize = 1;
            else if (!(str[i] >> 5 & 1))
                charSize = 2;
            else if (!(str[i] >> 4 & 1))
                charSize = 3;
            else if (!(str[i] >> 3 & 1))
                charSize = 4;
            else
                return -1; // not supposed to happen
        }
        else if (str[i] >> 6 & 3 != 2)
            return -1;
        --charSize;
        ++i;
    }
    return len;
}

It returns the number of chars, and -1 if it's not a valid UTF-8 string.

(By non-valid UTF-8 string, I mean the format is not valid. I don't check if the character actually exists)

EDIT: As stated in the comment section, this code doesn't handle decomposed unicode

4rzael
  • 679
  • 6
  • 17
  • 2
    This is a good illustration of why you really need to use a library. The above code will not necessarily count the number of characters properly because some characters can be encoded in more than one way e.g. å might be encoded as the single character `C3 A5` in UTF-8 or it might be encoded as an a followed by ˚ i.e. `61 CB 9A`. The two forms are called composed and decomposed Unicode respectively. – JeremyP Sep 11 '15 at 13:57
2

The C standard IO library can only read bytes. Your file probably contains multibyte characters, encoded with UTF8 or some other encoding. You'll need a library for interpreting such files.

It is possible that your file contains Latin1 text, in which case characters are bytes. In this case, you cannot use isgraph unless you have the proper locale set.

Bottom line: find the encoding used in your file. Then read it accordingly. In any case, plain C does not know about encodings.

lhf
  • 70,581
  • 9
  • 108
  • 149
  • 1
    See http://stackoverflow.com/questions/4588897/handling-multibyte-non-ascii-characters-in-c. – lhf Sep 11 '15 at 12:51
2

You need to understand which encoding is used for your characters. I guess it is very probably UTF-8 (and you should use UTF8 everywhere....), read Joel's blog on Unicode. If your encoding is not UTF-8 you should convert it to UTF-8 e.g. using libiconv.

Then you need a C library for UTF-8. There are many of them (but none is standardized in the C11 language yet). I recommend libunistring or glib (from GTK), but see also this.

Your code will change, since an UTF-8 character can take one to four [8 bits] bytes (but Wikipedia UTF-8 page mentions 6 bytes at most; See Unicode standards for details). You won't test if a byte (i.e. a plain C char) is a letter, but if a byte and the few bytes after it (given by a pointer, i.e. a char* or better by uint8_t*) encode a letter (including cyrillic letters, etc..).

Not every sequence of bytes is a valid UTF-8 representation, and you might want to validate a line (or a null-terminated C string) before analyzing it.

Community
  • 1
  • 1
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 2
    The maximum number of bytes needed to represent a Unicode code point in UTF-8 is 4, regardless of what older documents suggest. The last Unicode value is U+10FFFF. Once upon a decade or more ago, the upper bound was not defined. – Jonathan Leffler Sep 11 '15 at 13:47