6

I'm trying to create my own versions of C functions and when I got to memcpy and memset I assumed that I should cast the destination and sources pointers to char *. However, I've seen many examples where the pointers were casted to unsigned char * instead. Why is that?

void *mem_cpy(void *dest, const void *src, size_t n) {

    if (dest == NULL || src == NULL)
        return NULL;
    int i = 0;
    char *dest_arr = (char *)dest;
    char *src_arr = (char *)src;
    while (i < n) {
        dest_arr[i] = src_arr[i];
        i++;
    }
    return dest;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 4
    In that case it doesn't matter because you're only doing assignments. But generally it's better to use `unsigned char` if your're operating on raw bytes, and even more if you're doing bitwise operations with them – Jabberwocky Feb 09 '23 at 15:01
  • For what you do, it doesn't really matter. Both are 1 byte, and you never really care of what that byte represents. So, you could do `char` as well. That said, when dealing with bytes, without any more semantic, it is quite common to treat them as unsigned. For example, when you see the dump of file content in hexadecimal (like `hexdump` shows), we consider them being non negative number between 0 and 255 (FF). So, it is more instinctive to think of a "bunch of bytes" as "unsigned char", unless there is a specific semantic that gives more meaning to potentially negative values. – chrslg Feb 09 '23 at 15:07
  • Side-note: `n` is a `size_t`. `i` should be too. If you turned up the warnings on your compiler, it would likely warn you about the signed/unsigned mismatch. As written, this code will do *horrible* things when `n` exceeds `INT_MAX` (`2 ** 31 - 1` on most machines). Also, in modern C, it's rather simpler to replace the three separate lines `int i =0;`, `while (i < n){` and `i++` with just a `for` loop: `for (size_t i =0; i < n; ++i) {`. Compiler should compile both the same (assuming no `break`s/`continue`s), but `for` loops are easier to read, putting all the info in one place. – ShadowRanger Feb 09 '23 at 15:10

3 Answers3

11

It doesn't matter for this case, but a lot of folks working with raw bytes will prefer to explicitly specify unsigned char (or with stdint.h types, uint8_t) to avoid weirdness if they have to do math with the bytes. char has implementation-defined signedness, and that means, when the integer promotions & usual arithmetic conversions are applied, a char with the high bit set is treated as a negative number if signed, and a positive number if unsigned.

While neither behavior is necessarily wrong for a given problem, the fact that the behavior can change between compilers or even with different flags set on the same compiler, means you often need to be explicit about signedness, using either signed char or unsigned char as appropriate, and 99% of the time, the behaviors of unsigned char are what you want, so people tend to default to it even when it's not strictly required.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
7

There's no particular reason in this specific case, it's mostly stylistic.

But in general it is always best to stick to unsigned arithmetic when dealing with raw data. That is: unsigned char or uint8_t.

The char type is problematic because it has implementation-defined signedness and is therefore avoided in such code. Is char signed or unsigned by default?


NOTE: this is dangerous and poor style:

char *src_arr = (char *)src;

(And the cast hid the problem underneath the carpet)

Since you correctly used "const correctness" for src, the correct type is: const char *src_arr; I'd change to code to:

unsigned char *dest_arr = dest;
const unsigned char *src_arr = src;

A good rule of thumb for beginners is to never use a cast. I'm serious. Some 90% of all casts we see on SO in beginner-level programs are wrong, in one way or the other.


Btw (advanced topic) there's a reason why memcpy has the prototype as:

void *memcpy(void * restrict s1,
      const void * restrict s2,
      size_t n);

The restrict qualifier on the pointers tell the user of the function "hey I'm counting on you to not pass on two pointers to the same object or pointers that may overlap". Doing so would cause problems in various situations and for various targets, so this is a good idea.

It's much more likely that the user passes on overlapping pointers than null pointers, so if you are to have slow, superfluous error checking against NULL, you should also restrict qualify the pointers.

If the user passes on null pointers I'd just let the function crash, instead of slowing it down with extra branches that are pointless bloat in some 99% of all use cases.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • `char *src_arr = (char *)src;` [is not undefined behavior](https://stackoverflow.com/a/9079161/364696), unless *both* of the following are true: 1) The target of the pointer was in fact declared `const`, and 2) You attempt to mutate the target. While #1 may or may not be true for arbitrary `memcpy` calls, the code never tries to write to `*src_arr`, so it's not undefined behavior, just a bad idea. I agree that no casts are needed here (yay C letting `void*` become anything), and the OP should be `const` correct just to protect themselves from unintentional errors, but it's not undefined. – ShadowRanger Feb 09 '23 at 15:20
  • @ShadowRanger Yeah agreed, I'll edit the answer. – Lundin Feb 09 '23 at 15:25
  • @LearningCode As for why you don't require a cast here: *"A pointer to void may be converted to or from a pointer to any object type. A pointer to any object type may be converted to a pointer to void and back again; the result shall compare equal to the original pointer.*" — C11 – Harith Feb 09 '23 at 15:36
  • *"If the user passes on null pointers I'd just let the function crash"* ---> so an `assert()` would be applicable here? – Harith Feb 09 '23 at 15:37
  • 1
    @Haris Not quite, the 6.5.16.1 is the reason why: "the left operand has atomic, qualified, or unqualified pointer type, and (considering the type the left operand would have after lvalue conversion) one operand is a pointer to an object type, and the other is a pointer to a qualified or unqualified version of void, and the type pointed to by the left has all the qualifiers of the type pointed to by the right;" – Lundin Feb 09 '23 at 15:41
  • 1
    @Haris Yeah sure an assert would be better. Or just document that the function takes no responsibility for cleaning up someone else's null pointer trash, since it's not memcpy's responsibility. "src and dst must be pointers to allocated arrays of at least size n" or some such. – Lundin Feb 09 '23 at 15:42
  • Your comments about casting are on point, but I do personally tend to make an explicit exception for casting for arithmetic purposes. On the other hand, I suppose it takes a certain amount of understanding recognize what that means, so maybe the true novice is better served by ignoring those cases. – John Bollinger Feb 09 '23 at 15:47
  • 3
    @JohnBollinger Looking at how many pitfalls C has with type compatibility, type qualifiers, alignment, strict aliasing, endianess, void pointers, null pointers, function pointers, special rules for casting structs etc etc, it's better for beginners to simply not go there at all IMO. Only cast in case your personal language-lawyer is present. You have the right to refrain from casting. Any cast you use can or will be used against you. :) – Lundin Feb 09 '23 at 15:58
  • @Lundin: Certain combinations are vastly more common than others, and it's unfortunate that the Committee refuses to recognize categories of implementations that do things in the more common ways, so a program that only needs to work with platforms that use common abstraction models #1 and #3 could start with `#if`/`#error` directives to test for those, and could then safely assume that any actions that would be defined under those abstraction models would work as defined thereby, even if they might invoke UB on other implementations. – supercat Apr 21 '23 at 17:20
1

Why ... unsigned char* instead of char*?

Short answer: Because the functionality differs in select operations when char is signed and the C spec specifies unsigned char like functionality for str...() and mem...().


When does it make a difference?

When a function (like memcmp(), strcmp(), etc.) compares for order, one byte is negative and the other is positive, the order of the two bytes differ. Example: -1 < 1, yet when viewed as an unsigned char: 255 > 1.

When does it not make a difference?

When copying data and comparing for equality*1.


Non-2's compliment

*1 One's compliment and sign-magnitude encoding are expected to be dropped in the upcoming version C2x. Until then, those signed encodings support 2 zeroes. For str...() and mem...() functions, C specifies data access as unsigned char. This means only the +0 is a null character and order depends on pure binary, unsigned, encoding.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • Have there ever been any non-contrived implementations where `char` (as distinct from `signed char`) could not represent as many distinct values as `unsigned char`? – supercat Apr 21 '23 at 17:14
  • @supercat As a `char` has the same range as `signed char` or `unsigned char`, given your "distinct from `signed char`", `char` would have as many distinct values as `unsigned char`. – chux - Reinstate Monica Apr 21 '23 at 21:00
  • Perhaps I should reword my question: has there ever been any implementation in which `signed char` had fewer distinct values than `unsigned char`, *and* in which (unadorned) `char` was signed? – supercat Apr 21 '23 at 21:23
  • @supercat That case would potentially occur with non-2's complement integer encoding (e.g. 255 distinct values) and `char` is _signed_. My ancient experience with such machines used _unsigned_ `char`. Yet _signed_ was possible. C spec hints at that possibility with `` "For all functions in this subclause, each character shall be interpreted as if it had the type `unsigned char` (and therefore every possible object representation is valid and has a different value)." C17 § 7.24.1 3. Such concerns, with C23, hopefully will be moot with the expected 2's complement encoding requirement. – chux - Reinstate Monica Apr 21 '23 at 21:32
  • The Standard makes no attempt to anticipate and forbid everything the designer of a C implementation might do to undermine its usefulness. If the authors of the Standard thought it unlikely that anyone would ever produce an implementation that made unadorned `char` a ones'-complement signed type, even in the absence of a rule forbidding such a thing, they would have seen no reason to waste ink prohibiting it. While the parenthetical note for string.h functions may have been unnecessary, making their behavior independent of the signedness of unadorned `char` was essential to... – supercat Apr 21 '23 at 21:41
  • ...allow smooth interoperation between compilation units processed by implementations/configurations where `char` is signed and those where `char` is unsigned. Rather than having the Standard mandate that implementations behave as they already do, I'd rather it recognize categories of implementations that do or do not uphold common-but-not-universal behavioral guarantees, such as that initializing all of the bytes of a pointer object to zero will set its value to null. – supercat Apr 21 '23 at 21:45
  • Comments are venturing from the [question](https://stackoverflow.com/questions/75400447/assignment-create-my-own-memcpy-why-cast-the-destination-and-source-pointers-t/75404340?noredirect=1#comment134170023_75404340). Perhaps all that interesting info belongs in a different place than comments about an answer of a `memcpy()` question. – chux - Reinstate Monica Apr 21 '23 at 21:49