2

In UTF8 I use to count characters (not bytes) using this function:

int schars(const char *s)
{
    int i = 0;

    while (*s) {
        if ((*s & 0xc0) != 0x80) i++;
        s++;
    }
    return i;
}

Does this work on implementations where plain char is unsigned char?

David Ranieri
  • 39,972
  • 7
  • 52
  • 94
  • Fun, barely related, facts: plain `char` *is* `unsigned char` under gcc by default. Under MSVC, it's `signed char`. By the standard, `char`, `unsigned char`, and `signed char` are three distinct types. – Corbin Jan 09 '13 at 10:21
  • 1
    @Corbin I think that plain `char` is `signed char` on gcc – David Ranieri Jan 09 '13 at 10:24
  • Could have sworn that gcc has it as signed by default, but looks like you're right. Whoops :). – Corbin Jan 09 '13 at 10:27
  • This is because the default `char` type has implementation-defined signedness, unlike all the other integer types like `int`, which are all guaranteed to be signed by default. – Lundin Jan 09 '13 at 10:31
  • @Corbin, David: for GCC the "default if you don't specify `-fsigned-char` or `-funsigned-char`" depends on the target. When you come to configure GCC for a new target it might be that there's a "default for the default" if you don't configure anything. I don't know. – Steve Jessop Jan 09 '13 at 10:35
  • @SteveJessop wow, I did not know this, that mean that I can force gcc to have a particular type? – David Ranieri Jan 09 '13 at 10:38
  • @DavidRF: yes, although it also varies by target what happens if you make a choice that conflicts with the target's ABI. In principle you might see strange things happen when you call a function in a dll that takes a `char` parameter, since the caller and callee don't agree what the possible values of `char` are. In practice I'd guess that most targets will in effect just convert/reinterpret the value in the expected way. – Steve Jessop Jan 09 '13 at 10:46
  • @SteveJessop Ah, good to know. For some reason had assumed that it was purely the compiler's choice. – Corbin Jan 09 '13 at 11:11

3 Answers3

3

It works as well when char is unsigned as it does when it's signed.

In both a signed 2's complement representation and in an unsigned representation, the 8th and 7th bits of a UTF8 code unit are 10 if and only if the code unit is not the first code unit of a code point. So you're counting 1 for the first code unit of each code point.

int is not guaranteed to be a large enough type to contain the number of characters in every string, but I assume you don't care ;-)

"Character" is potentially an ambiguous term. This code counts Unicode code points, which is not the same thing as displayable characters ("graphemes"). Sometimes multiple code points represent a single grapheme, for example when combining marks are used for accents. About the only practical use for knowing how many code points there are in a Unicode string, is to calculate how many bytes it will occupy when encoded as UTF-32. If you're careful, you can ensure that the only code that needs to process "characters" is the font engine, plus some complex operations like Unicode normalization and character encodings.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
2

It should.

You are only using binary operators and those function the same irrespective of whether the underlying data type is signed or unsigned. The only exception may be the != operator, but you could replace this with a & and then embrace the whole thing with a !, ala:

!((*s & 0xc0) & 0x80)

and then you have solely binary operators.

You can verify that the characters are promoted to integers by checking section 3.3.10 of the ANSI C Standard which states that "Each of the operands [of the bitwise AND] shall have integral type."

EDIT

I amend my answer. Bitwise operations are not the same on signed as on unsigned, as per 3.3 of the ANSI C Standard:

Some operators (the unary operator ~ , and the binary operators << , >> , & , ^ , and | , collectively described as bitwise operators )shall have operands that have integral type. These operators return values that depend on the internal representations of integers, and thus have implementation-defined aspects for signed types.

In fact, performing bitwise operations on signed integers is listed as a possible security hole here.

In the Visual Studio compiler signed and unsigned are treated the same (see here).

As this SO question discusses, it is better to use unsigned char to do byte-wise reads of memory and manipulations of memory.

Community
  • 1
  • 1
Richard
  • 56,349
  • 34
  • 180
  • 251
  • Thanks Richard, that's what I thought – David Ranieri Jan 09 '13 at 10:28
  • @DavidRF, I've continued to think about the problem and now I am less sure of my answer. If you were reading into `unsigned char` everything would be good. As it is, I am not sure of what happens when you perform `&` on a `signed char`. I'm trying to figure out how to do a safe conversion. – Richard Jan 09 '13 at 11:03
  • yes, "Ramón" returns 4 instead of 5 using `!((*s & 0xc0) & 0x80)`, thanks for the advise – David Ranieri Jan 09 '13 at 11:14
  • 1
    @DavidRF, I tried rephrasing the question [here](http://stackoverflow.com/questions/14233716/bitwise-and-on-signed-chars) - hopefully some of the answers there will put this to rest. – Richard Jan 09 '13 at 13:12
1

Yes, it will.

*s will be promoted to int before the computations take place. So, your code is equivalent to:

if (((int) *s & 0xC0) != 0x80) i++;

And the above will work even if char is unsigned.

Frédéric Hamidi
  • 258,201
  • 41
  • 486
  • 479