9

Consider:

constexpr char s1[] = "a";
constexpr char s2[] = "abc";
std::memcmp(s1, s2, 3);

If memcmp stops at the first difference it sees, it will not read past the second byte of s1 (the nul terminator), however I don't see anything in the C standard to confirm this behavior, and I don't know of anything in C++ which extends it.

n1570 7.24.4.1 PDF link

int memcmp(const void *s1, const void *s2, size_t n);

The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2

Is my understanding correct that the standard describes the behavior as reading all n bytes of both arguments, but libraries can short circuit as-if they did?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ryan Haining
  • 35,360
  • 15
  • 114
  • 174
  • [C++ at least](http://en.cppreference.com/w/c/string/byte/memcmp) seems to say that the behavior is undefined if you access beyond either object, which means passing a size of 3 in your example leads to UB. – bnaecker Apr 04 '18 at 23:27
  • @bnaecker cppreference is falible, the question is whether it *can* access beyond `s1` in this example. – Ryan Haining Apr 04 '18 at 23:29
  • 3
    I could forsee an implementation that would use SIMD instructions that would be likely to compare past the first difference. I'd reckon the standard would avoid limiting against such optimizations – kmdreko Apr 04 '18 at 23:30
  • Sure, it's not the standard, I'm just using it as one piece of evidence. – bnaecker Apr 04 '18 at 23:30
  • 1
    Nothing says the function must do a character by character comparison. So no guaranteed short-circuiting, I would say. –  Apr 04 '18 at 23:30
  • This is why you want to use `strncmp` instead of `memcmp`. – Sam Varshavchik Apr 04 '18 at 23:38
  • 2
    If you want "short-circuiting" behaviour, use `strcmp()`. There is nothing in the specification of `memcmp()` which requires the bytes to be compared sequentially, or that specifies circumstances in which any bytes in either sequence not be accessed. Since `sizeof(s1) < 3`, your `memcpy()` call has undefined behaviour. – Peter Apr 04 '18 at 23:38
  • 1
    re: strcmp/strncmp. I can rewrite this example to be arrays of ints and it's the same question. – Ryan Haining Apr 04 '18 at 23:40
  • @vu1p3n0x that's what I was thinking. – Ryan Haining Apr 04 '18 at 23:41

1 Answers1

7

The function is not guaranteed to short-circuit because the standard doesn't say it must.

Not only is it not guaranteed to short-circuit, but in practice many implementations will not. For example, glibc compares elements of type unsigned long int (except for the last few bytes), so it could read up to 7 bytes past the location which compared differently on a 64-bit implementation.

Some may think that this won't cause an access violation on the platforms glibc targets, because access to these unsigned long ints will always be aligned and therefore will not cross a page boundary. But when the two sources have a different alignment, glibc will read two consecutive unsigned long ints from one of the sources, which may be in different pages. If the different byte was in the first of those, an access violation can still be triggered before glibc performed the comparison (see function memcmp_not_common_alignment).

In short: Specifying a length that is larger than the real size of the buffer is undefined behavior even if the different byte occured before this length, and can cause crashes on common implementations.

Here's proof that it can crash: https://ideone.com/8jTREr

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
interjay
  • 107,303
  • 21
  • 270
  • 254
  • The glibc example isn't evidence; it uses the same technique for `strcmp` which *is* guaranteed not to read past a null terminator by the standard. It relies on platform-specific knowledge that such a read won't cause problems – M.M Apr 05 '18 at 00:15
  • @M.M [glibc's strcmp implementation](https://github.com/lattera/glibc/blob/master/string/strcmp.c) doesn't contain such an optimization. – interjay Apr 05 '18 at 00:24
  • Maybe it's been changed ; I recall reading discussion about it in the past – M.M Apr 05 '18 at 00:25
  • @M.M It looks like it will use the compiler intrinsic for `strcmp` if one is available, but I assume that intrinsic will be implemented safely by the compiler. I have edited to add an analysis of the code detailing how the `memcmp` implementation can cause a crash if the size is too short. – interjay Apr 05 '18 at 00:44
  • 1
    Are you capable of constructing an example that crashes glibc's `memcmp` which wouldn't if it read one character at a time? – zneak Apr 05 '18 at 00:47
  • @zneak Added an example. – interjay Apr 05 '18 at 01:12