2

I'd like to get 5 instead of 10 for the following program. Does anybody know how to fix the code to count the number of multibyte characters? Thanks.

/* vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: */
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>

size_t nchars(const char *s) {   
    size_t charlen, chars;
    mbstate_t mbs;

    chars = 0;
    memset(&mbs, 0, sizeof(mbs));
    while (
            (charlen = mbrlen(s, MB_CUR_MAX, &mbs)) != 0
            && charlen != (size_t)-1
            && charlen != (size_t)-2
            ) {
        s += charlen;
        chars++;
    }   

    return (chars);
}   

int main() {
    setlocale(LC_CTYPE, "en_US.utf8");
    char * text = "öçşğü";

    printf("%zu\n", nchars (text));

    return 0;
}
$ ./main.exe 
10
user1424739
  • 11,937
  • 17
  • 63
  • 152

1 Answers1

2

Secondary problem: you should initialize an object of type mbstate_t via the mbsinit function, not memcpy. The all-bytes-zero mbsinit is not guaranteed to represent an initial shift state, nor even any valid shift state.

The primary problem with your code revolves around the fact that it is analyzing a string literal, whose representation is determined at compile time based on the actual encoding of those characters in the source file, on their representation in the compiler's source character set, and on the execution character set chosen by the compiler. You cannot choose LC_CTYPE arbitrarily -- it has to be matched to the data for the mb conversion functions to work as intended.

C does not define a mechanism for a program to identify a locale whose LC_TYPE corresponds to the execution character set, nor does it even require such a locale to exist. Your compiler's documentation should describe the mapping between source characters and execution characters, however, possibly in terms of a locale or well-known encoding, and it may even describe a way for you to specify that. Your compiler's documentation may also describe a way for you to specify the encoding it should assume for source files.

Furthermore, you have an additional potential issue with Unicode, that there can be mismatch between what you, a human, consider a "character" and the Unicode characters with which it is represented. Generally, this involves characters bearing diacritical marks such as accents. Many of the more commonly-used of these have a single-character "composed" representation, but can also be represented as a sequence of a base character plus one or more combining characters.

mbrlen() is unlikely to distinguish between base and combining characters, so even without any encoding confusion, your observed result could arise from the characters being represented in decomposed form in the source files, or being transformed into that form by the compiler.

The bottom line is that your program depends on environmental and implementation characteristics that the standard does not specify, therefore it may behave differently with different implementations, as indeed seems to be the observation. Your particular observation could arise, for example, from the source file being encoded in UTF-8, the the compiler assuming it to be encoded in a single-byte encoding such as ISO-8859-1 instead, yet the compiler using UTF-8 for its execution character set.

Your approach might work without changes if you ensure that the compiler interprets the source file according to that file's actual encoding, and that it uses UTF-8 as its execution character set. Alternatively, in C11 or later you can ensure that the runtime encoding of that specific string is UTF-8 by using a UTF-8 literal, like so:

char * text = u8"öçşğü";

That takes care of only the execution-side encoding, however. You still need to match the source file encoding to the actual encoding expected by the compiler, and you can still be affected by differences between pre-composed and decomposed characters.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • All good information, but--in UTF-8--don't you also have to worry about the normalization form? If the string is UTF-8 of NFC, then it's five precombined characters. If it's NFD, then it's five base characters plus five combining characters, for a total of 10. – Adrian McCarthy Feb 08 '19 at 17:03
  • That's a fair point, @AdrianMcCarthy, but outside the scope of encoding. UTF-8 will happily encode Unicode character sequences in any normalization form or none, and I expect `mbrlen` not to distinguish between base characters and combining characters. The representation in the source should therefore be normalized to form NFC. – John Bollinger Feb 08 '19 at 17:08
  • Yep. I was just going off the first line of the post that wants "5 instead of 10." That's going to happen only if the text is in NFC (or if mbrlen works in terms of grapheme clusters, which I doubt). It's possible that different compilers are handling the identical source file differently: Some might take the source as-is, and others might normalize the source to one normal form or the other. That could explain some of the inconsistent results. – Adrian McCarthy Feb 08 '19 at 17:13
  • Indeed it could, @AdrianMcCarthy. I have updated the answer to discuss some of those points. – John Bollinger Feb 08 '19 at 17:19