Implement `memcpy()`: Is `unsigned char ` needed, or just `char `?

Question

I was implementing a version of memcpy() to be able to use it with volatile. Is it safe to use char * or do I need unsigned char *?

volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n)
{
    const volatile char *src_c  = (const volatile char *)src;
    volatile char *dest_c       = (volatile char *)dest;

    for (size_t i = 0; i < n; i++) {
        dest_c[i]   = src_c[i];
    }

    return  dest;
}

I think unsigned should be necessary to avoid overflow problems if the data in any cell of the buffer is > INT8_MAX, which I think might be UB.

Why would you think it would be UB? Do both allow access to all of the bits regardless of what number those those bits might represent based on a particular type? — Retired Ninja, Mar 03 '19 at 00:00
Ok, so it is reinterpreting the data no matter what it was originally, right? The value is possibly undefined, but the overall result is that you return it intact, and don't use the value, so it is defined behaviour in the end, right? — alx - recommends codidact, Mar 03 '19 at 00:03
Undefined behaviour doesn't get "un-undefined"; that makes no sense. But I'm also not seeing where you perceive the UB here. Instead of just "I think might be UB" please expand to present _clearly_ and _in detail_ your problem. — Lightness Races in Orbit, Mar 03 '19 at 00:21
You are reading a value which may perfectly be `(uint8_t)200`, and in the moment you dereference it as a (possibly `signed`) `char *`, the value magically transforms into negative something. I have never had the need to do any type punning, so I do not know what is allowed and what is not, but that seems at least weird, to me. — alx - recommends codidact, Mar 03 '19 at 00:32
@peanut: reading a bit pattern out of memory through a cast pointer does not involve a conversion. The value in memory is just a bunch of bits; it does not have any other information. So the only issue is whether all bit patterns are valid, and what happens to bit patterns which are not. As requested, I addressed that in an answer — rici, Mar 03 '19 at 04:46
You might be able to prove that plain `char` is safe, but just using `unsigned char` will save you all that effort. (If there were a problem, it might show up with `-0` values/representations in an implementation where plain `char` is signed and doesn't use 2's-complement. I doubt that any such systems exist.) — Keith Thompson, Mar 03 '19 at 05:13

score 3 · Accepted Answer · answered Mar 03 '19 at 04:43

In theory, your code might run on a machine which forbids one bit pattern in a signed char. It might use ones' complement or sign-magnitude representations of negative integers, in which one bit pattern would be interpreted as a 0 with a negative sign. Even on two's-complement architectures, the standard allows the implementation to restrict the range of negative integers so that INT_MIN == -INT_MAX, although I don't know of any actual machine which does that.

So, according to §6.2.6.2p2, there may be one signed character value which an implementation might treat as a trap representation:

Which of these [representations of negative integers] applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two [sign-magnitude and two's complement]), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones’ complement, if this representation is a normal value it is called a negative zero.

(There cannot be any other trap values for character types, because §6.2.6.2 requires that signed char not have any padding bits, which is the only other way that a trap representation can be formed. For the same reason, no bit pattern is a trap representation for unsigned char.)

So, if this hypothetical machine has a C implementation in which char is signed, then it is possible that copying an arbitrary byte through a char will involve copying a trap representation.

For signed integer types other than char (if it happens to be signed) and signed char, reading a value which is a trap representation is undefined behaviour. But §6.2.6.1/5 allows reading and writing these values for character types only:

Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a trap representation. (Emphasis added)

(The third sentence is a bit clunky, but to simplify: storing a value into memory is a "side effect that modifies all of the object", so it's permitted as well.)

In short, thanks to that exception, you can use char in an implementation of memcpy without worrying about undefined behaviour.

However, the same is not true of strcpy. strcpy must check for the trailing NUL byte which terminates a string, which means it needs to compare the value it reads from memory with 0. And the comparison operators (indeed, all arithmetic operators) first perform integer promotion on their operands, which will convert the char to an int. Integer promotion of a trap representation is undefined behaviour, as far as I know, so on the hypothetical C implementation running on the hypothetical machine, you would need to use unsigned char in order to implement strcpy.

On a platform using non 2's complement, (yes they are relics these days), with `char a = { negative 0 }; char b;` with `b = a`, must `b` take on `'-0' or can the pesky `-0` become `0` as it is assigned to `b`?. If this is true, then although the discussion about trap-values is good and using `char` does avoid UB, it does not insure `for (size_t i = 0; i < n; i++) { dest_c[i] = src_c[i]; }` performs as desired: copy a binary image form here to there. — chux - Reinstate Monica, Mar 03 '19 at 16:22
This lengthy language-lawyering explanation should convince the reader to **not** use `char` type to manipulate bytes. It takes a very savvy C expert to explain why it is safe to use `char` where `unsigned char` is an obvious solution for plain programmers. Keep code as simple as possible, use `unsigned char`. — chqrlie, Mar 03 '19 at 19:54
@chux: Has there *ever* been a conforming C99 or C11 implementation that did not use two's-complement representation for `char`? — supercat, Mar 04 '19 at 22:39
@supercat Try [gcc -funsigned-char](https://stackoverflow.com/a/46463173/2410359) and you will compile in C99/C11 that did not use two's-complement representation for `char`. See also [Are there any non-twos-complement implementations of C?](https://stackoverflow.com/questions/12276957/are-there-any-non-twos-complement-implementations-of-c) — chux - Reinstate Monica, Mar 04 '19 at 23:37
@chux: I meant `signed char`. Answers to that question didn't mention any implementations of *C99 or C11*. — supercat, Mar 05 '19 at 04:52
@supercat, as far as I know, no. The often-mentioned Unisys ClearPath OS 2200 has a C compiler targeted at C90, and there doesn't seem to be any plan to support C99 or any more recent version. (Although the C90 compiler continues to be maintained.) There are ongoing discussions to remove representations other than two's complement from C2x (and C++2x), and no other examples have surfaced in committee discussions, according to the proposal authors. — rici, Mar 05 '19 at 04:57
@rici: The Standard should recognize a category of two's-complement implementations where bytes are assembled into longer types in big-endian fashion without padding bits, a category of two's-complement implementation where they are assembled in little-endian fashion without padding bits, and an "anything goes" category which imposes no particular requirements on implementation. I don't see any reason to mention ones'-complement or sign-magnitude choices, but nor do I see any need to explicitly limit things to two's-complement. — supercat, Mar 05 '19 at 05:08

chux - Reinstate Monica · Answer 2 · 2019-03-03T19:42:15.237

Is it safe to use char * or do I need unsigned char *?

Perhaps

"String handling" functions such as memcpy() have the specification:

For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value). C11dr §7.23.1 3

Using unsigned char is the specified "as if" type. Little to be gained attempting others - which may or may not work.

Using char with memcpy() may work, but extending that paradigm to other like functions leads to problems.

A single big reason to avoid char for str...() and mem...() like functions is that sometimes it makes a functional difference unexpectedly.

memcmp(), strcmp() certainly differ with (signed) char vs. unsigned char.

Pedantic: On relic non-2's complement with signed char, only '\0' should end a string. Yet negative_zero == 0 too and a char with negative_zero should not indicate the end of a string.

Joshua · Answer 3 · 2019-03-03T01:39:34.323

1

You do not need unsigned.

Like so:

volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n)
{
    const volatile char *src_c  = (const volatile char *)src;
    volatile char *dest_c       = (volatile char *)dest;

    for (size_t i = 0; i < n; i++) {
        dest_c[i]   = src_c[i];
    }

    return  dest;
}

Attemping to make a confirming implementation where char has a trap value will eventually lead to a contradiction:

fopen("", "rb") does not require use of only fread() and fwrite()
fgets() takes a char * as its first argument and can be used on binary files.
strlen() finds the distance to the next null from a given char *. Since fgets() is guaranteed to have written one, it will not read past the end of the array and therefore will not trap

edited Mar 03 '19 at 01:39

answered Mar 03 '19 at 00:38

Joshua

40,822
8
72
132

1

What if `char` has a trap representation? – Eric Postpischil Mar 03 '19 at 01:04
@EricPostpischil: In memory? It's not allowed to. I could dig out my K&R book but it's going to tell me no way. Only uninitialized locals and pointer types do that. – Joshua Mar 03 '19 at 01:11
2

Yes, [`char` may have a trap representation](https://stackoverflow.com/questions/48570155/does-accessing-an-int-with-a-char-potentially-have-undefined-behavior). K&R is not relevant; we use ISO C now. The bit about uninitialized locals having potentially undefined behavior is a special rule not related to trap representations. – Eric Postpischil Mar 03 '19 at 01:22
The reasoning that because `fread`, `fwrite`, and `fgets` can be used with binary files, they must support the reading and writing of arbitrary data with `char` is incorrect. Per C 2018 7.21.2 3, “A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were earlier written out to that stream, under the same implementation.” C’s streams are only required to support writing the representations of data and reading it back into the same type with results that compare equal in the type. – Eric Postpischil Mar 03 '19 at 17:35
@EricPostpischil: Have there ever been any non-contrived C99 or C17 implementations where CHAR_MAX-CHAR_MIN is an even number? Are there likely to be? If not, why should anyone care whether ISO C would allow such a thing? – supercat Mar 04 '19 at 22:50
@supercat: I do not maintain a list of C implementations. A good way to manage information about how to deal with portability issues like this is to have a team of skilled and knowledgeable people survey the implementations and practices and craft a document with specifications and recommendations. Actually, that sort of work should have been done decades ago, and I would recommend using it as a guiding document. The C tag tells us to use ISO 9899 as a base, and the question asks what is safe, not what is kind of likely to work. – Eric Postpischil Mar 08 '19 at 14:13
Additionally, asking about past implementations misses most of them. The fact that the C standard specified things—including boundaries between various kinds of defined and undefined behaviors—enabled people to craft new C implementations. Optimizers would be hampered if they were not able to rely on certain behavior behind not defined by the standard. So, when asking whether we can neglect parts of the C standard, we must not just ask what may happen in past C implementations but what may happen in future C implementations, to which the answer is we do not really know. – Eric Postpischil Mar 08 '19 at 14:16
@EricPostpischil: The authors of the Standard made no effort to identify all of the actions that should be available to people writing code for any particular kinds of tasks on any particular kinds of platforms. Instead, they wanted support for a variety of "popular extensions" to be determined by a marketplace driven by "quality of implementation". A good quality implementation intended to be suitable for a particular task will "not prevent the programmer from doing what needs to be done" to accomplish that task. A low-quality implementation could be "conforming" without being able... – supercat Mar 08 '19 at 16:17
...to run any useful programs at all [the published Rationale actually acknowledges this possibility] but the Committee expected that quality implementations will strive to be useful whether or not the Standard requires them to do so. If the intention of making things UB was to make them unusable, and if the authors of the Standard wanted to avoid needlessly breaking code, the Rationale should have mentioned some reason for reclassifying the evaluation of e.g. `-1<<1` from fully-defined behavior on implementations without padding bits to Undefined Behavior. If, however, the authors... – supercat Mar 08 '19 at 16:21
...of the Standard intended that implementations behave in C89 fashion *except when there's a good reason to do something else*, then the reclassification would need no particular explanation. Code that relies upon signed left shift behaving in some fashion is likely to work usefully only on implementations that would have no reason for processing it in some other fashion, so from a practical perspective changing the behavior from defined to UB would only affect a program's behavior on platforms where it wouldn't have worked anyway. – supercat Mar 08 '19 at 16:30
@supercat: You have just written an excessive amount of text on how C implementations ought to behave which is not relevant to the question at hand about what C programmers ought to guard against. Per the C tag description, I will answer and inform people about conformance with the C standard, particularly when they ask what is safe, rather than what is practical for some specific situation. It is fine to write code with requirements in addition to the C standard, provided the requirements are documented, but that is different from answering questions about the standard. – Eric Postpischil Mar 08 '19 at 17:34
@EricPostpischil: The Standard's definition of "conforming C implementation" is so loose that merely knowing something is a "conforming C implementation"--without knowing at least something about its quality--would be totally useless, and its definition of "conforming C program" is even looser. The Standard's definition of "strictly conforming program" is a bit more useful, but only relevant if one is trying to accomplish tasks that are within the abilities of strictly conforming programs. – supercat Mar 08 '19 at 20:36

score 1 · Answer 4 · answered Mar 03 '19 at 19:49

1

The unsigned is not needed, but there is no reason to use plain char for this function. Plain char should only be used for actual character strings. For other uses, the types unsigned char or uint8_t and int8_t are more precise as the signedness is explicitly specified.

If you want to simplify the function code, you can remove the casts:

volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n) {
    const volatile unsigned char *src_c = src;
    volatile unsigned char *dest_c = dest;

    for (size_t i = 0; i < n; i++) {
        dest_c[i] = src_c[i];
    }
    return dest;
}

answered Mar 03 '19 at 19:49

chqrlie

131,814
10
121
189

Agree for `char` and `unsigned char`. Is `uint8_t` as valid as `unsigned char` in this case? In any other case where no type-punning occurs, I always use `uint8_t` for bytes, but I don't know in this case. – alx - recommends codidact Mar 03 '19 at 19:55
2

`uint8_t` is not necessarily available: it is specified as having exactly 8 value bits and two's complement representation. On systems that have it, it must be the same type as `unsigned char` because `unsigned char` must have a size of 1 byte, which cannot be smaller than 8 bits. (One could argue that a purposely perverse system where `char` would have ones' complement representation could also have a separate `uint8_t` type with two's complement representation, but I leave this discussion to DS9K implementers). – chqrlie Mar 03 '19 at 20:00
@AnttiHaapala: you are correct, I was referring to `int8_t` regarding the representation of negative numbers, which must be two's complement. – chqrlie Mar 09 '19 at 17:40

Implement `memcpy()`: Is `unsigned char ` needed, or just `char `?

4 Answers4

Linked

Related

Implement `memcpy()`: Is `unsigned char *` needed, or just `char *`?

4 Answers4

Linked

Related

Implement `memcpy()`: Is `unsigned char ` needed, or just `char `?