1

Though the common sense and literature is clear about the behaviour of strcmp():

int strcmp( const char *lhs, const char *rhs );

Negative value if lhs appears before rhs in lexicographical order.

Zero if lhs and rhs compare equal.

Positive value if lhs appears after rhs in lexicographical order.

I can't seem to make it return any values other than -1, 0 and 1.

Sure it is true that the behaviour is consistent with the definition but I was expecting values bigger or smaller than 1 or -1 since the definition asserts that results will be <0, 0 or >0, not -1, 0 or 1.

I tested this in several compilers and libraries with the same results. I would like to see an example where that's not the case.

sample code

#include <stdio.h> 
#include <string.h> 

  
int main() 
{  
   printf("%d ", strcmp("a", "a"));
   printf("%d ", strcmp("abc", "aaioioa"));
   printf("%d ", strcmp("eer", "tsdf"));
   printf("%d ", strcmp("cdac", "cdac"));
   printf("%d ", strcmp("zsdvfgh", "ertgthhgj"));
   printf("%d ", strcmp("abcdfg", "rthyuk"));
   printf("%d ", strcmp("ze34", "ze34"));
   printf("%d ", strcmp("er45\n", "io\nioa"));
   printf("%d", strcmp("jhgjgh", "cdgffd"));
}

Result: 0 1 -1 0 1 -1 0 -1 1

Community
  • 1
  • 1
anastaciu
  • 23,467
  • 7
  • 28
  • 53
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/206114/discussion-on-question-by-anastaciu-strcmp-only-returns-1-0-and-1-no-matter-t). – Samuel Liew Jan 17 '20 at 00:50

4 Answers4

4

The specification says that the numbers have to be negative, zero or positive, but it doesn't lock down the exact value necessary. The library itself may behave in more specific ways.

The spec means that code like this is technically invalid:

if (strcmp(a, b) == 1)

This may "work on my machine" but not someone else's who uses a different library.

Where what you should be writing is:

if (strcmp(a, b) > 0)

That's all it really means: expect values other than just 1/-1 and code accordingly.

tadman
  • 208,517
  • 23
  • 234
  • 262
4

The C standard clearly says (C11 §7.24.4.2 The strcmp function):

The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2.

It doesn't say how much greater than or less than zero the result must be; a function that always returns -1, 0 or +1 meets the standard; so does a function that sometimes returns values with a magnitude larger than 1, such as -27, 0, +35. If your code is to conform to the C standard, it must not assume either set of results; it may only assume that the sign of the result is correct.

Here is an implementation of strcmp() — named str_cmp() here so that the result can be compared with strcmp() — which does not return -1 or +1:

#include <string.h>
#include <stdio.h>

static int str_cmp(const char *s1, const char *s2)
{
    while (*s1 == *s2 && *s1 != '\0')
        s1++, s2++;
    int c1 = (int)(unsigned char)*s1;
    int c2 = (int)(unsigned char)*s2;
    return (c1 - c2);
}

int main(void) 
{  
   printf("%d ", strcmp("a", "a"));
   printf("%d ", strcmp("abc", "aAioioa"));
   printf("%d\n", strcmp("eer", "tsdf"));

   printf("%d ", str_cmp("a", "a"));
   printf("%d ", str_cmp("abc", "aAioioa"));
   printf("%d\n", str_cmp("eer", "tsdf"));
   return 0;
}

When run on a Mac (macOS Mojave 10.14.6; GCC 9.2.0; Xcode 11.13.1), I get the output:

0 1 -1
0 33 -15

I did change your data slightly — "aaioioa" became "aAioioa". The overall result is no different (but the value 33 is bigger than you'd get with the original string) — the return value is less than, equal to, or greater than zero as required.

The str_cmp() function is a legitimate implementation and is loosely based on a historically common implementation of strcmp(). It has slightly more care in the return value, but you can find two minor variants of it on p106 of Brian W Kernighan and Dennis M Ritchie The C Programming Language, 2nd Edn (1988) — one using array indexing, the other using pointers:

int strcmp(char *s, char *t)
{
    int i;
    for (i = 0; s[i] == t[i]; i++)
        if (s[i] == '\0')
            return 0;
    return s[i] - t[i];
}

int strcmp(char *s, char *t)
{
    for ( ; *s == *t; s++, t++)
        if (*s == '\0')
            return 0;
    return *s - *t;
}

The K&R code might not return the expected result if the plain char type is signed and if one of the strings contains 'accented characters', characters from the range -128 .. -1 (or 0x80 .. 0xFF when viewed as unsigned values). The casting in my str_cmp() code treats the data as unsigned char (via the cast); the (int) cast isn't really necessary because of the assignments. The subtraction of two unsigned char values converted to int produces a result in the range -255 .. +255. However, modern versions of the C library don't use the direct subtraction like that if they return only -1, 0 or +1.

Note that the C11 standard §7.24.4 String comparison functions says:

The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

You can look at How do I check if a value matches a string?. The outline there shows:

if (strcmp(first, second) == 0)    // first equal to second
if (strcmp(first, second) <= 0)    // first less than or equal to second
if (strcmp(first, second) <  0)    // first less than second
if (strcmp(first, second) >= 0)    // first greater than or equal to second
if (strcmp(first, second) >  0)    // first greater than second
if (strcmp(first, second) != 0)    // first unequal to second

Note how comparing to zero uses the same comparison operator as the test you're making.

You could (but probably shouldn't) write:

if (strcmp(first, second) <= -1)    // first less than second
if (strcmp(first, second) >= +1)    // first greater than second

You'd still get the same results, but it is not sensible to do so; always comparing with zero is easier and more uniform.

You can get a -1, 0, +1 result using:

unsigned char c1 = *s1;
unsigned char c2 = *s2;
return (c1 > c2) - (c1 < c2);

For unrestricted integers (rather than integers restricted to 0 .. 255), this is safe because it avoids integer overflows whereas subtraction gives the wrong result. For the restricted integers involved with 8-bit characters, overflow on subtraction is not an issue.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    Returning 0 or ±1 is a relatively recent change in behaviour. If you look at the two questions mentioned in the comments, you'll see that (a) they're from 2012 and 2014, and (b) they quote source code from Apple and FreeBSD and GNU from that era, and sometimes the code returns the difference between the characters and sometimes it returns ±1 when they're different. Things appear to have changed since then. – Jonathan Leffler Jan 17 '20 at 08:53
2

Please re-read this bit

Negative value if lhs appears before rhs in lexicographical order.

Is -1 sufficient for this statement to be true?

Zero if lhs and rhs compare equal.

Positive value if lhs appears after rhs in lexicographical order.

Is 1 sufficient for this statement to be true?

So the sample code is acting as per spec.

EDIT

Just test the return value for zero, less than zero or more than zero. As per spec this should work in all implementations.

EDIT 2

I think this will fulfull the spec - have not tested :-(

 for (size_t i = 0; s1[i] && s2[i] &&s1[i] == s2[i]; ++i) {
     // Empty
   }
   return s2[i] - s1[i]; // This may be the wrong way around

This will return values other that 1, -1 or 0.

Community
  • 1
  • 1
Ed Heal
  • 59,252
  • 17
  • 87
  • 127
1

Here are a few examples of C libraries with strcmp() implementations that do not always return -1, 0 or +1:

The Bionic libc has a BSD based implementation of strcmp():

int
strcmp(const char *s1, const char *s2)
{
    while (*s1 == *s2++)
        if (*s1++ == 0)
            return (0);
    return (*(unsigned char *)s1 - *(unsigned char *)--s2);
}

The Dietlibc does the same. It is even non conforming version if configured for WANT_SMALL_STRING_ROUTINES:

int
strcmp (const char *s1, const char *s2)
{
#ifdef WANT_SMALL_STRING_ROUTINES
    while (*s1 && *s1 == *s2)
        s1++, s2++;
    return (*s1 - *s2);
#else
    // a more advanced, conforming implementation that tests multiple characters
    // at a time but still return the difference of characters as unsigned bytes
#endif
}

Glibc has this implementation of strcmp in its generic directory, used for exotic architectures:

int
strcmp (p1, p2)
     const char *p1;
     const char *p2;
{
  register const unsigned char *s1 = (const unsigned char *) p1;
  register const unsigned char *s2 = (const unsigned char *) p2;
  unsigned reg_char c1, c2;

  do
    {
      c1 = (unsigned char) *s1++;
      c2 = (unsigned char) *s2++;
      if (c1 == '\0')
    return c1 - c2;
    }
  while (c1 == c2);

  return c1 - c2;
}

Musl C library has a very compact implementation:

int strcmp(const char *l, const char *r)
{
    for (; *l==*r && *l; l++, r++);
    return *(unsigned char *)l - *(unsigned char *)r;
}

The newlib has this implementation:

int
_DEFUN (strcmp, (s1, s2),
    _CONST char *s1 _AND
    _CONST char *s2)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
  while (*s1 != '\0' && *s1 == *s2)
    {
      s1++;
      s2++;
    }

  return (*(unsigned char *) s1) - (*(unsigned char *) s2);
#else
  // a more advanced approach, testing 4 bytes at a time, still returning the difference of bytes
#endif
}

Many alternative C libraries seem to follow the same pattern and return the difference of bytes, which matches the specification. But the implementations you tested seem to consistently return -1, 0 or +1. Don't rely on this. It might change in future releases, or even with the same system using different compilation flags.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 1
    Thank you for the detailed answer, since I asked this question I stumbled in a twist to this question, if you use string literals as arguments the return is always 0, 1 or -1, but if you assing these to variables and use them as arguments the return values are the lexicographical difference betwwen the 2 first different characters, see https://wandbox.org/permlink/ER7c999o1h4r7DLy. I've also tested with clang with the same results. – anastaciu Apr 23 '20 at 20:18
  • 1
    @anastaciu: Yes, I should have hinted that your test program actually evaluates the `strcmp()` at compile time with a standardized result. But if you disable optimisations with `-O0`, you might get different results. – chqrlie Apr 23 '20 at 20:20