Why do a bitwise-and of a character with 0xff?

Question

I am reading some code that implements a simple parser. A function named scan breaks up a line into tokens. scan has a static variable bp that is assigned the line to be tokenized. Following the assignment, the whitespace is skipped over. See below. What I don't understand is why the code does a bitwise-and of the character that bp points to with 0xff, i.e., what is the purpose of * bp & 0xff? How is this:

while (isspace(* bp & 0xff))
    ++ bp;

different from this:

while (isspace(* bp))
    ++ bp;

Here is the scan function:

static enum tokens scan (const char * buf)
                    /* return token = next input symbol */
{   static const char * bp;

    while (isspace(* bp & 0xff))
        ++ bp;

        ..
}

For `isspace`, the behavior is undefined if the value of `*bp` is not representable as `unsigned char` and is not equal to `EOF`.- so perhaps this is a fancy cast, instead of doing `(unsigned char) *bp`. Is `bp` a `char*`? — Ted Lyngmo, May 24 '21 at 19:10
Look up the concept of bitmasks. Basically, 0xff translates to `11111111` in binary, or all `1`'s for a single byte. This is useful if you only want a single byte of data, for example, instead of the entire value (which could be multiple bytes). For example, an `int` may be 4 bytes, so if you only want the lowermost 1 byte you can simply do `int_variable & 0xff` to get the value. — h0r53, May 24 '21 at 19:10
In this case, you are effectively checking the lowermost byte of `bp`, discarding possible other bytes (by doing `& 0xff`) then seeing if the result matches a whitespace character. — h0r53, May 24 '21 at 19:12
I now see that `bp` is a `char*` - the formatting threw me off a bit. — Ted Lyngmo, May 24 '21 at 19:17
Excellent! Thank you. I am still a bit puzzled, though. "bp" is declared this way: static const char * bp; Doesn't that mean bp always points to a (1-byte) character? In which case, there are no other bytes to discard or be concerned with when checking for whitespace, right? — Roger Costello, May 24 '21 at 19:26
@RogerCostello yes, but I think the issue has less to do with the type of `bp` and more to do with the type of the argument of `isspace`, which is an `int`, which may be multiple bytes. — h0r53, May 24 '21 at 19:28
The [C++ version](https://en.cppreference.com/w/cpp/string/byte/isspace) of the online documentation for `isspace` etc. that I use says "_To use these functions safely with plain `char`s (or `signed char`s), the argument should first be converted to `unsigned char`_" - so if the `ctype.h` function makes the assumption that it will get `[-1, 255]` and has a simple lookup table, like `static bool isspace[257] = { false, false ... };` then if you send in a negative value (except EOF), it could catch fire. — Ted Lyngmo, May 24 '21 at 19:35
Could it be a way to be portable to systems with a byte size different from 8 bits? — nielsen, May 24 '21 at 19:41
@nielsen It's really a way to make it representable as an `unsigned char` since implementations are allowed to make the assumption that it'll get EOF or something representable as an `unsigned char`. I think KamilCuk's answer explains it. — Ted Lyngmo, May 24 '21 at 19:43
Apart from the missing include files, and variable definitions; the missing part is **default integer promotion**. [and BTW: @TedLyngmo : please don't post references to C++ documentation for a C issue. The languages are different. try `sizeof ('a')` , for instance] Oh, and there is the *signedness* of char. — wildplasser, May 24 '21 at 20:10
@wildplasser I only brought the C++ version of the documentation I use into this because that mentions the cast to make the call safe while the C documentation on the same wiki does not. I know that `'a'` is an `int` in C and not a `char` as in C++ but that wasn't the point. — Ted Lyngmo, May 24 '21 at 20:17
@wildplasser If it explains how wheels work but the Volkswagen manual does not, I might. The reason for the cast to `unsigned char` is the same in C and C++. — Ted Lyngmo, May 24 '21 at 20:21

Vlad from Moscow · Accepted Answer · 2021-05-24T19:56:59.007

From the C Standard (7.4 Character handling <ctype.h>)

1 The header <ctype.h> declares several functions useful for classifying and mapping characters.198) In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

In this call

isspace(* bp)

the argument expression *bp having the type char is converted to the type int due to the integer promotions.

If the type char behaves as the type signed char and the value of the expression *bp is negative then the value of the promoted expression of the type int is also will be negative and can not be representable as a value of the type unsigned char.

This results in undefined behavior.

In this call

isspace(* bp & 0xff)

due to the bitwise operator & the result value of the expression * bp & 0xff of the type int can be represented as a value of the type unsigned char.

So it is a trick used instead of writing a more clear code like

isspace( ( unsigned char )*bp )

The function isspace is usually implemented such a way that it uses its argument of the type int as an index in a table with 256 values (from 0 to 255). If the argument of the type int has a value that is greater than the maximum value 255 or a negative value (and is not equal to the value of the macro EOF) then the behavior of the function is undefined.

KamilCuk · Answer 2 · 2021-05-24T19:44:52.497

From cppreference isspace(): The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF.

When *bp is negative, for example it's -42, then it is not representable as unsigned char, because it's negative and unsigned char, well, must be positive or zero.

On twos-complement systems values are sign extended to bigger "width", so then they will get left-most bits set. Then when you take 0xff of the wider type, the left-most bits are cleared, and you end up with a positive value, lower or equal to 0xff, I mean representable as unsigned char.

Note that arguments to & undergo implicit promotions, so the result of *bp is converted to int before even calling isspace. Let's assume that *bp = -42 for example and assume a sane platform with 8-bit char that is signed and that int has 32-bits, then:

*bp & 0xff               # expand *bp = -42
(char)-42 & 0xff         # apply promotion
(int)-42 & 0xff          # lets convert to hex assuming twos-complement
(int)0xffffffd6 & 0xff   # do & operation
(int)0xd6                # lets convert to decimal
214                      # representable as unsigned char, all fine

Without the & 0xff the negative value would result in undefined behavior.

I would recommend to prefer isspace((unsigned char)*bp).

Basically the simplest isspace implementation looks like just:

static const char bigarray[257] = { 0,0,0,0,0,...1,0,1,0,... };
// note: EOF is -1
#define isspace(x)  (bigarray[(x) + 1])

and in such case you can't pass for example -42, cause bigarray[-41] is just invalid.

score 1 · Answer 3 · answered May 24 '21 at 19:23

1

Your question:

How is this:

while (isspace(* bp & 0xff))
    ++ bp;

different from this:

while (isspace(* bp))
    ++ bp;

The difference is, in the first example you are always passing the lowermost byte at bp to isspace, due to the result of a bitwise AND with a full bitmask (0b11111111 or 0xff). It's possible that the argument to isspace contains a type that is larger than 1 byte. For example, isspace is defined as isspace(int c), so as you can see the argument here is an int, which may be multiple bytes depending on your system.

In short, it's a sanity check to ensure that isspace is only comparing a single byte from your bp variable.

answered May 24 '21 at 19:23

h0r53

3,034
2
16
25

1

...but `bp` is a `char*`, so `*bp` is a `char`, ie, it occupies a single byte. – pmg May 24 '21 at 19:26
1

That's correct, but `isspace` actually takes an int, not a byte, so some casting occurs. It's basically a sanity check to ensure only a single byte from `bp` is used for the argument to `isspace` – h0r53 May 24 '21 at 19:27
2

hm... when `*bp` is lower than 0, it is sign-extended to `int`, so it has more set bits then, right? – KamilCuk May 24 '21 at 19:27
2

I see, you mean 'a single byte from `*bp` after it is (automatically) converted to `int`' – pmg May 24 '21 at 19:28
1

Yes, it does seem "pointless" and I'm sure in most cases it would be, but there is likely an edge case where this type of explicit comparison is useful (for example, implicit typecasts, sign-extension, unicode, etc). – h0r53 May 24 '21 at 19:29
@h0r53 : what's taken from `bp` is already and forcibly a byte, since it's defined as `const char *`, which make this sanity check technically useless here. Also, the reason why `isspace(c)` admits an `int` is because `fgetc()` returns an _int_ itself, so he can return special markers such as EOF aside of regular characters. So trimming them with 0xff would return fake and undue characters. – Obsidian May 24 '21 at 19:32
both are wrong if `*bp` does not reference `unsigned char` – 0___________ May 24 '21 at 19:36

0___________ · Answer 4 · 2021-05-24T19:36:44.623

while (isspace(* bp & 0xff))
    ++ bp;

&&

while (isspace(* bp))
    ++ bp;

Strictly speaking, both are incorrect if bp does not reference unsigned char.

In this case it should be:

while (isspace((unsigned char)(*bp & 0xff)))
    ++ bp;

or better

while (isspace(*bp == EOF ? EOF : (unsigned char)(*bp & 0xff)))
    ++ bp;

isspace is undefined if parameter is not EOF or it does not have the value of unsigned char

if *bp references char it has to be:

while (isspace((unsigned char)(*bp)))
    ++bp;

score 1 · Answer 5 · answered May 24 '21 at 19:46

In c char can be signed or unsigned https://en.wikipedia.org/wiki/C_data_types

When passed to isspace, bp will be promoted to an integer. If it is signed and the high bit is set then it will be sign extended to become a negative integer. This may mean it is not an unsigned char or EOF as required by the isspace function https://linux.die.net/man/3/isspaceNo

See http://cpp.sh/9mp2i for how it changes the bitwise and changes value of that isspace sees

score 0 · Answer 6 · answered Jun 18 '21 at 05:50

If we assume bits of char type are always 8,
then the code bitwise-and operator with 0xff here will confuse us.

But what about that if char type is not always 8-bits?
Then 0xff may have another meaning, right?

Actually, the char type is not always 8-bits and we can see the detail in C99 standard. The char type in standard is not defined as 8 bits.

The following is how C99 standard describe the size of char type.

6.5.3.4 The sizeof operator When applied to an operand that has type char, unsigned char, or signed char, (or a qualiﬁed version thereof) the result is 1. When applied to an operand that has array type, the result is the total number of bytes in the array.) When applied to an operand that has structure or union type, the result is the total number of bytes in such an object, including internal and trailing padding.

6.2.5 Types An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be positive. If any other character is stored in a char object, the resulting value is implementation-deﬁned but shall be within the range of values that can be represented in that type.

For example, TMS320C28x DSP from Texas Instruments has a char with 16 bits.
For the compiler specifies here, CHAR_BIT as 16 on page 99.

This appears to be a modern processor (currently being sold), compilers supporting C99 and C++03.

Why do a bitwise-and of a character with 0xff?

6 Answers6