8

I have compared two string literals using the memcmp function.

#include <stdio.h>
#include <string.h>

int main() 
{
  char str1[] = "abcd";
  char str2[] = "ab";

  if (memcmp(str1, str2, 4) == 0) 
  {
    printf("equal string\n");
  }
  return 0;
}

In the above program, str2 is shorter than the str1. It means string str2 is accessed out of bounds.

So, is this undefined behavior?

Boann
  • 48,794
  • 16
  • 117
  • 146
msc
  • 33,420
  • 29
  • 119
  • 214

2 Answers2

11

The behaviour of your code is undefined. The C standard does not require that memcmp returns as soon as the result is known; that is it doesn't necessarily have to return when \0 is compared with 'c' despite the value of 'c' == '\0' being 0 for any character encoding supported by the language. The standard also doesn't specify the order in which the lexicographical comparisons are to be made (although it would be tricky for an implementation not to start from the beginning).

str2 is a char[3] type. It's possible that an attempt to access the 4th element is made.

Reference: http://en.cppreference.com/w/c/string/byte/memcmp

Bathsheba
  • 231,907
  • 34
  • 361
  • 483
  • 1
    While true, a standard library implementer that continues processing after the result is known, is a bad implementer... and they should feel bad. – StoryTeller - Unslander Monica May 02 '18 at 07:39
  • 3
    When asking about undefined behavior, expect the worst possible library, compiler and machine architecture and, of course, the most idiotic implementations :) –  May 02 '18 at 07:46
  • 2
    @StoryTeller: I'm wondering though with increasing use of parallelisation whether or not future implementations could make gains by reading ahead. – Bathsheba May 02 '18 at 07:47
  • @Bathsheba - Mmm.. haven't thought about that. My mind is often stuck in a single flow of control mentality. But still, not to return asap and continue processing intentionally... should still make our library implementer feel bad. – StoryTeller - Unslander Monica May 02 '18 at 07:51
  • I guess the only way to be save is to check the length of the string in advance before doing the memcmp() – Mike May 02 '18 at 07:52
  • @Mike Or, of course, use `strcmp()`? :) Searching for the `NUL` byte, so you can use a function that doesn't check for `NUL` bytes, seems *kind of* silly. –  May 02 '18 at 07:53
  • 1
    @FelixPalmen - Radical idea, that is ;) – StoryTeller - Unslander Monica May 02 '18 at 07:54
  • 4
    @StoryTeller The library is not required to access the memory via char pointers. It is more likely that the library is designed to access the memory with 34 or 64 bit width as long as the rest sequence is wide enough and aligned correctly, and only the wrong aligned first bytes, and the last bytes not fitting into this size are compared on single byte granularity. It is very likely that in this example the memory is compared with one single 32bit comparison. – Rudi May 02 '18 at 08:08
  • @Rudi - You are of course correct, but I didn't assume single byte granularity. Even at 64 bit access, assuming the two objects are large enough, continuing redundant access after the result is obtained (in however many instructions) is still silly. – StoryTeller - Unslander Monica May 02 '18 at 08:12
  • @StoryTeller Assuming a 32bit system and that both strings are aligned at a 32bit border on the stack, the out of bound access happens on the first comparison, since ``str2`` does not fit entirely into the first 32bits. – Rudi May 02 '18 at 08:22
  • @Rudi - For those two strings, sure. I was commenting in broader strokes. – StoryTeller - Unslander Monica May 02 '18 at 08:23
  • Wow, this is insane! – kamikaze May 03 '18 at 06:00
-2

Yes, behavior of your code is undefined. However, as long as you are using if (memcmp(str1, str2, 3) == 0) (note that byte count is 3 instead of 4. I.e. minimum of two), behavior of your code would be acceptable and correct.

The behavior is undefined if access occurs beyond the end of either object pointed to by lhs and rhs. The behavior is undefined if either lhs or rhs is a null pointer.

In case of strcmp, it stops as soon as if finds \0. However, for memcmp,

it is flawed assumption, that memcmp compares byte-by-byte and does not look at bytes beyond the first point of difference. The memcmp function makes no such guarantee. It is permitted to read all the bytes from both buffers before reporting the result of the comparison.

So, I would write my code like this:

#include <stdio.h>
#include <string.h>

#define MIN(X, Y) (((X) < (Y)) ? (X) : (Y))

int main() 
{
  char str1[] = "abcd";
  char str2[] = "ab";
  int charsToCompare = MIN(strlen(str1), strlen(str2)) + 1;

  if (memcmp(str1, str2, charsToCompare) == 0) 
  {
    printf("equal string\n");
  }
  return 0;
}

More details and analysis on memcmp can be found here.

Sandeep
  • 333
  • 2
  • 7
  • 1.) the size of `str2` is `3`, not `2` 2.) please add a reference for your citation. –  May 02 '18 at 07:42
  • @FelixPalmen 2 represents number of bytes you want to compare. It is not length of string. – Sandeep May 02 '18 at 07:51
  • And `str2` holds three(!) bytes. So your choice of `2` is kind of random. Still, my second remark was the more important one -- where's that citation from? –  May 02 '18 at 07:53
  • @FelixPalmen I see what you mean. Corrected my answer. Thanks. – Sandeep May 02 '18 at 07:58
  • 1
    Thanks, but what about **the source of your citation**? I didn't find this in the C standard description of `memcmp()`, so I'm **really** curious where it comes from. –  May 02 '18 at 07:59
  • 1
    That `MIN` macro is gonna end up calling `strlen` two times for one of the two strings, for no reason – StoryTeller - Unslander Monica May 02 '18 at 09:19
  • IMHO, the C Standard could benefit from having a few more variations on memcmp, including ones which return a pointer to the first difference, or simply report whether two blocks are equal, and possibly allowing code to indicate whether to optimize for blocks whose early portion matches, or those whose early portion does not match (if blocks are expected to be equal, and code doesn't care which block is "higher", using `diffs |= (*p1++ ^ *p2++)` and testing `diffs` at the end may be faster than branching after each comparison). – supercat May 02 '18 at 15:42