2

The code I am handling has a lot of castings that are being made from uint8 to char, and then the C library functions are called upon this castings.I was trying to understand why would the writer prefer uint8 over char. For example:

uint8 *my_string = "XYZ";
strlen((char*)my_string);

What happens to the \0, is it added when I cast?

What happens when I cast the other way around?

Is this a legit way to work, and why would anybody prefer working with uint8 over char?

chqrlie
  • 131,814
  • 10
  • 121
  • 189
schanti schul
  • 681
  • 3
  • 14

1 Answers1

8

The casts char <=> uint8 are fine. It is always allowed to access any defined memory as unsigned characters, including string literals, and then of course to cast a pointer that points to a string literal back to char *.

In

uint8 *my_string = "XYZ";

"XYZ" is an anonymous array of 4 chars - including the terminating zero. This decays into a pointer to the first character. This is then implicitly converted to uint8 * - strictly speaking, it should have an explicit cast though.


The problem with the type char is that the standard leaves it up to the implementation to define whether it is signed or unsigned. If there is lots of arithmetic with the characters/bytes, it might be beneficial to have them unsigned by default.

A particularly notorious example is the <ctype.h> with its is* character class functions - isspace, isalpha and the like. They require the characters as unsigned chars (converted to int)! A piece of code that does the equivalent of char c = something(); if (isspace(c)) { ... } is not portable and a compiler cannot even warn about this! If the char type is signed on the platform (default on x86!) and the character isn't ASCII (or, more properly, a member of the basic execution character set), then the behaviour is undefined - it would even abort on MSVC debug builds, but unfortunately just causes silent undefined behaviour (array access out of bounds) on glibc.

However, a compiler would be very loud about using unsigned char * or its alias as an argument to strlen, hence the cast.

  • Can you point to the documentation saying the "is*" functions need unsigned char? The man pages I have say "int". – coredump Jul 08 '18 at 08:00
  • 3
    @coredump yea added a link - they're unsigned chars converted to `int` + `EOF`, just the same values as `getchar()` would return – Antti Haapala -- Слава Україні Jul 08 '18 at 08:02
  • 1
    As long you use `char` you are fine for characters. It has to be able to represent all characters of the target character set by definition. The example above better had used `char` from the start - actually `const char` to avoid any problems trying to write to the string literal accidentally..Interestingly, the string functions interpret the passed `char` arguments (resp. the pointers' targets) as `unsigned char` where this would make a difference. – too honest for this site Jul 08 '18 at 11:02
  • @Olaf we do know very little about the code - of course the 2 line example should better have used `char` but it didn't need to have a string literal, nor would it need to have used `strlen` either since the return value is not stored anywhere. – Antti Haapala -- Слава Україні Jul 08 '18 at 12:19
  • Well, we have to rely on what OP posts is a [mcve] (emphasise the **c**) or at least the relevant part of the code. – too honest for this site Jul 08 '18 at 12:40
  • I wasn't much about the string **literal** (that was only the part about using `const`), but an array/pointer for a string in general. The string functions don't make much sense if you don't have some kind of C-string. And the `memXXX` functions take a `void *`). – too honest for this site Jul 08 '18 at 12:48
  • 1
    It is indeed quite unfortunate that so many systems use signed `char` by default. It makes the C library semantics inconsistent. `isalpha` and friends, `tolower`, `toupper`, but also `strcmp`, `getchar`... all use unsigned char values, a very often overlooked fact even by expert C programmers. Kudos for pointing this out. – chqrlie Jul 08 '18 at 13:08
  • 1
    And signed char makes the `char ch = getchar(); if (ch == EOF) {...}` "work" :( – Antti Haapala -- Слава Україні Jul 08 '18 at 13:59
  • @chqrlie: "So many". Most systems seem to be using unsign `char`. But yeah, there is just one `signed char` platform requiredd to mess things up. Not the first, but maybe the last legacy in the C language. It would be really good if the next version of the standard would simply make a clear cut, even at the cost ofd compatibility. Call me a dreamer … – too honest for this site Jul 08 '18 at 18:25
  • @AnttiHaapala It doesn't work (as you know), but the likelyness for an (typically) `0xFFU` character code appearing in a string is just low (and it indeed works for people ignoring other than English language like the origin of ASCII.. – too honest for this site Jul 08 '18 at 18:28
  • @AnttiHaapala: and the fans of French author *Pierre Louÿs*. Note however that `0xFF` is an invalid byte in UTF-8, which is becoming prevalent today... – chqrlie Jul 09 '18 at 06:33
  • @Olaf: *Most systems seem to be using unsigned `char`*? Not really. Most compilers allow the programmer to specify the default signedness of `char`, but the default on x86 is signed, which covers most PCs, laptops and servers. Granted there are even more phones and embedded systems that may have unsigned by default, that is still a good proportion of C code in use today. – chqrlie Jul 09 '18 at 06:37
  • @chqrlie it should probably read "most device units" as in number of processors... ARM <3 – Antti Haapala -- Слава Україні Jul 09 '18 at 06:39
  • @Olaf: *Call me a dreamer*: I have a vision for this evolution. We could define a subset of the C language where such implementation details are more constrained and many cases of undefined behavior defined (unsigned char by default, two's complement representation for integers, with defined overflow and signed right shift semantics, IEEE representations for floating point...), Some risky library functions removed (`gets`, `strncpy`, `strtok`...), `EOF == -1`, `EXIT_SUCCESS == 0`, a much higher diagnostics level with errors instead of warnings... We should call this **strict C**. – chqrlie Jul 09 '18 at 06:59
  • 1
    I agree with everything else except defined signed integer overflow. Whether MAX_INT + 1 = MAX_INT + 1 or MIN_INT, it is a bug already :D – Antti Haapala -- Слава Україні Jul 09 '18 at 07:05
  • @chqrlie I'd much rather have a language that aborts on signed integer overflow :D – Antti Haapala -- Слава Україні Jul 09 '18 at 07:14
  • @AnttiHaapala: a program that has integer overflows may indeed be buggy, but defining the behavior one way or another removes the opportunity for the compiler to perform counter-intuitive optimisations. Trapping on overflow may not be feasible at minimal cost on many architectures, wrap around is the de facto standard behavior on two's complement architectures. – chqrlie Jul 09 '18 at 08:12
  • @chqrlie The counter-intuitive optimizations come from the fact that people assume that if something is possible in the assembly on an architecture, then the C compiler must implement it that way. There is nothing counter-intuitive if people don't assume this, or if not know assembly. C.f. aligned/unaligned access. [BTW, for further discussion we do have the C chat room](https://chat.stackoverflow.com/rooms/54304/c)... – Antti Haapala -- Слава Україні Jul 09 '18 at 08:21
  • @chqrlie, Antti: To complete the nastiness: We concentrated on `0xFF`, but actually `EOF` is not required to be `-1`, just negative. Which will make all non-ASCII characters problematic (or >7 bit non-ASCII-based encodings). – too honest for this site Jul 09 '18 at 11:32
  • @chqrlie: By "most systems" I mean the standard ABI. Compilers are flexible, but most platforms require a certain behaviour for interoperability e.g. with the standard library and others. ARM ABIs for instance do simply becahuse early ARMs lacked some sign-extend loads IIRC, so they used `unsigned char` from the early days on. And that's one of the most used platforms wich decades more devices than e.g. x86. To complete that I just had to check 8051 which I assume also uses unsigned `char`s. Those two are the vast majority of systems already. – too honest for this site Jul 09 '18 at 11:36
  • @chqrlie: As am embedded developer, while I'm not happy about UB, I don't see how this can be avoided without major degradation in code performance and increased size on common CPUs which don't support e.g. hardware address range verification for each access. Btw: `gets` **is not** part of the standard since 7 years now. – too honest for this site Jul 09 '18 at 11:41
  • @Olaf: Of course `gets` has been removed from the standard, but C compilers still routinely accept it and C libraries have it for compatibility reasons. If compiling in *strict C* mode, this function would be forbidden. I am not advocating a *safe* version of the language that has costly check at runtime, just a few semantical changes that make certain undefined expressions defined, such as `-1 >> 1`. – chqrlie Jul 09 '18 at 14:14
  • @chqrlie: This is clearly no way related to the compiler, but of the library only. Said that, the MS libc even warns about safe functions just to promote their non-standard stuff. This does not really help beginners trust the standard. Said that: on typical freestanding environments there is no stdlib, maybe except for functions required by the compiler to implement the normal operators (`memcpy`, etc. namely, gcc is a good example here). Your example is exactly bad to support your position. Many smaller CPUs don't have a signed shift right instruction. – too honest for this site Jul 09 '18 at 15:13