Count characters in UTF8 when plain char is unsigned

Question

In UTF8 I use to count characters (not bytes) using this function:

int schars(const char *s)
{
    int i = 0;

    while (*s) {
        if ((*s & 0xc0) != 0x80) i++;
        s++;
    }
    return i;
}

Does this work on implementations where plain char is unsigned char?

Fun, barely related, facts: plain `char` *is* `unsigned char` under gcc by default. Under MSVC, it's `signed char`. By the standard, `char`, `unsigned char`, and `signed char` are three distinct types. — Corbin, Jan 09 '13 at 10:21
Could have sworn that gcc has it as signed by default, but looks like you're right. Whoops :). — Corbin, Jan 09 '13 at 10:27
This is because the default `char` type has implementation-defined signedness, unlike all the other integer types like `int`, which are all guaranteed to be signed by default. — Lundin, Jan 09 '13 at 10:31
@Corbin, David: for GCC the "default if you don't specify `-fsigned-char` or `-funsigned-char`" depends on the target. When you come to configure GCC for a new target it might be that there's a "default for the default" if you don't configure anything. I don't know. — Steve Jessop, Jan 09 '13 at 10:35
@SteveJessop wow, I did not know this, that mean that I can force gcc to have a particular type? — David Ranieri, Jan 09 '13 at 10:38
@DavidRF: yes, although it also varies by target what happens if you make a choice that conflicts with the target's ABI. In principle you might see strange things happen when you call a function in a dll that takes a `char` parameter, since the caller and callee don't agree what the possible values of `char` are. In practice I'd guess that most targets will in effect just convert/reinterpret the value in the expected way. — Steve Jessop, Jan 09 '13 at 10:46
@SteveJessop Ah, good to know. For some reason had assumed that it was purely the compiler's choice. — Corbin, Jan 09 '13 at 11:11

Steve Jessop · Answer 1 · 2013-01-09T10:26:57.743

It works as well when char is unsigned as it does when it's signed.

In both a signed 2's complement representation and in an unsigned representation, the 8th and 7th bits of a UTF8 code unit are 10 if and only if the code unit is not the first code unit of a code point. So you're counting 1 for the first code unit of each code point.

int is not guaranteed to be a large enough type to contain the number of characters in every string, but I assume you don't care ;-)

"Character" is potentially an ambiguous term. This code counts Unicode code points, which is not the same thing as displayable characters ("graphemes"). Sometimes multiple code points represent a single grapheme, for example when combining marks are used for accents. About the only practical use for knowing how many code points there are in a Unicode string, is to calculate how many bytes it will occupy when encoded as UTF-32. If you're careful, you can ensure that the only code that needs to process "characters" is the font engine, plus some complex operations like Unicode normalization and character encodings.

Thanks Steve, yes, size_t is a better option but I use `int` to avoid casts on every call — David Ranieri, Jan 09 '13 at 10:22

score 2 · Accepted Answer · edited May 23 '17 at 11:43

It should.

You are only using binary operators and those function the same irrespective of whether the underlying data type is signed or unsigned. The only exception may be the != operator, but you could replace this with a & and then embrace the whole thing with a !, ala:

!((*s & 0xc0) & 0x80)

and then you have solely binary operators.

You can verify that the characters are promoted to integers by checking section 3.3.10 of the ANSI C Standard which states that "Each of the operands [of the bitwise AND] shall have integral type."

EDIT

I amend my answer. Bitwise operations are not the same on signed as on unsigned, as per 3.3 of the ANSI C Standard:

Some operators (the unary operator ~ , and the binary operators << , >> , & , ^ , and | , collectively described as bitwise operators )shall have operands that have integral type. These operators return values that depend on the internal representations of integers, and thus have implementation-defined aspects for signed types.

In fact, performing bitwise operations on signed integers is listed as a possible security hole here.

In the Visual Studio compiler signed and unsigned are treated the same (see here).

As this SO question discusses, it is better to use unsigned char to do byte-wise reads of memory and manipulations of memory.

@DavidRF, I've continued to think about the problem and now I am less sure of my answer. If you were reading into `unsigned char` everything would be good. As it is, I am not sure of what happens when you perform `&` on a `signed char`. I'm trying to figure out how to do a safe conversion. — Richard, Jan 09 '13 at 11:03
yes, "Ramón" returns 4 instead of 5 using `!((*s & 0xc0) & 0x80)`, thanks for the advise — David Ranieri, Jan 09 '13 at 11:14
@DavidRF, I tried rephrasing the question [here](http://stackoverflow.com/questions/14233716/bitwise-and-on-signed-chars) - hopefully some of the answers there will put this to rest. — Richard, Jan 09 '13 at 13:12

Frédéric Hamidi · Answer 3 · 2013-01-09T10:25:03.140

1

Yes, it will.

*s will be promoted to int before the computations take place. So, your code is equivalent to:

if (((int) *s & 0xC0) != 0x80) i++;

And the above will work even if char is unsigned.

edited Jan 09 '13 at 10:25

answered Jan 09 '13 at 10:13

Frédéric Hamidi

258,201
41
486
479

1

Strictly speaking, the literals 0xC0 and 0x80 are already of type `int`, so the only promotion taking place is the one where *s is converted to int. – Lundin Jan 09 '13 at 10:23
@Lundin, you're right, there are no `char` literals. Answer updated accordingly, thanks :) – Frédéric Hamidi Jan 09 '13 at 10:25
(In C++ they would be of type char, however) – Lundin Jan 09 '13 at 10:30
@Lundin: in C++, `0xC0` and `0x80` have type `int` the same as C. The difference in C++ is that `'a'` has type `char`, whereas in C it has type `int`. – Steve Jessop Jan 09 '13 at 10:32
Yup, `'\xC0'` would be `char`, but plain `0xC0` is `int`. – Frédéric Hamidi Jan 09 '13 at 10:32
Ah yeah that's correct, for some reason I mixed it up with character literals. Hex literals are of course of type int :) – Lundin Jan 09 '13 at 10:34

Count characters in UTF8 when plain char is unsigned

3 Answers3