2

I have a small assignment to do in C and I tried to find the best way I can to compare two strings (char arrays of course since strings are not defined in C).

This is my code :

int equal(char *s1, char *s2)
{
    int a = 0;
    while(!(a = *(unsigned char *)s1 - *(unsigned char *)s2) && *s2) ++s1,  ++s2;
    return (a == 0) ? 1 : 0;
}

It works but I don't see why I have to convert my char to an unsigned char.

(Of course I cannot use <string.h> in my assignment.)

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    Why are you using `unsigned char` in the first place? I don't think you need to... – Cornstalks Jan 30 '15 at 23:38
  • 2
    Also, why is the code so complicated? Just do one thing on each line – Mooing Duck Jan 30 '15 at 23:39
  • @Cornstalks That's his question lol –  Jan 30 '15 at 23:48
  • Why do you think you need the `(unsigned char *)` casts? The code will work exactly the same without them. The only time you need to convert to `unsigned` is if you have a possibility of overflow or values that are out of range for a particular signed type. – Chris Dodd Jan 30 '15 at 23:52
  • @user9000: I know, but my question is part of [rubber duck debugging](http://en.wikipedia.org/wiki/Rubber_duck_debugging). I'm asking why the OP did it at all in the first place, to try and understand where the flaw in his logic is. – Cornstalks Jan 30 '15 at 23:58
  • @ChrisDodd Well if I remove (unsigned char *), I just end up with an infinite loop. – GitCommit Victor B. Jan 31 '15 at 00:05
  • 1
    @TivBroc: no you don't, or at least not from the casts. If you're seeing an infinite loop, you must be doing something else different, like leaving off the parentheses around the assignment, or the `!` – Chris Dodd Jan 31 '15 at 00:11
  • @ChrisDodd Yeah I think it is when I call the function because I use : if(equal(argv[1], "factorial")). That is why ! Thanks for your answer ;) – GitCommit Victor B. Jan 31 '15 at 00:15
  • @ChrisDodd "only time you need to convert to `unsigned` is if you have a possibility of overflow or values that are out of range" does not sound correct. (unless `char` and `int` are the same size - maybe that is it) Otherwise this code's subtraction does not benefit (or not) by casting to `unsigned char`. – chux - Reinstate Monica Jan 31 '15 at 00:49
  • @chux: that's why the casts are unneeded in this code -- there's no possibility of overflow, so they make no difference. – Chris Dodd Jan 31 '15 at 01:51
  • @ChrisDodd Ah, I _do_ see the reason for the casts now: "there's no possibility of overflow` may not be so. Check my answer's #6 – chux - Reinstate Monica Jan 31 '15 at 02:06

2 Answers2

4

How about

int equal(const char *s1, const char *s2)
{
    int i;
    for (i=0; s1[i] || s2[i]; i++)
        if (s1[i] != s2[i])
            return 0;
    return 1;   
}

Or if you prefer while loops:

int equal(const char *s1, const char *s2)
{
    while (*s1 || *s2)
        if (*s1++ != *s2++)
            return 0;
    return 1;   
}

To answer your specific question, in order to compare two strings (or indeed two characters) there is no need to convert them to unsigned char. I hope you agree my method is a little more readable than yours.

abligh
  • 24,573
  • 4
  • 47
  • 84
4
  1. The original code is fairly optimal. For simple equality comparisons, there is no need for the (unsigned char *) casts. The following works fine. (but see point #6):

    int equal(char *s1, char *s2) {
      int a = 0;
      while(!(a = *s1 - *s2) && *s2) ++s1, ++s2;
      return (a == 0) ? 1 : 0;
    }
    
  2. In making more optimal code, there is no need to compare both strings for the null character '\0' as in if (*s1 || *s2) .... As code checks for a non-zero a, checking only 1 string is sufficient.

  3. "... of course since strings are not defined in C" is not so. C does define "string", though not as a type :

    "A string is a contiguous sequence of characters terminated by and including the first null character" C11 §7.1.1 1

  4. Using (unsigned char *) make sense if code is attempting to not only simply compare equality, but order. Even in this case, the type could be char. But by casting to unsigned char or even signed char, code provides consistent results across platforms even where some have char as signed char and others as unsigned char.

    // return 0, -1 or +1
    int order(const char *s1, const char *s2) {
      const unsigned char *uc1 = (const unsigned char *) s1;
      const unsigned char *uc2 = (const unsigned char *) s2;
    
      while((*uc1 == *uc2) && *uc1) ++uc1, ++uc2;
        return (*uc1 > *uc2) - (*uc1 < *uc2);
    }
    
  5. Using const in the function signature allows code to be used with const char * as order(buffer, "factorial");. Otherwise calling OP's equal(char *s1, char *s2) with equal(buffer, "factorial"); is undefined behavior. The stricken text would be true if the routine modified *s1 or *s2, but it does not. Using const does reduce certain warnings and allow for some optimizations. Credit: @abligh


  1. This is a corner case where the casting is needed. If range of char is the same as the range of int (some graphics processors do that) and char is a signed char, then *s1 - *s2 can overflow and that is undefined behavior (UB). Of course, platforms that have the same range for char and int are rare. IMO, it is doubtful even on such machines, a non-casted version of this code would fail, but it is technically UB.
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • Actually I think the original code has a bug in as does your simplification and your second example. Consider s1=`abcd` and s2=`abc` (original code and your simplification) or s1=`abc` and s2=`abcd` (second example). – abligh Jan 31 '15 at 09:19
  • Also could you cite a reference for the UB claims (not disupting - for my interest). I thought subtraction of 2 signed `char`s was defined even with overflow. If a signed char is the same length as an `int`, promotion to an `int` should also be unproblematic too. So as far as I can see there is no case where there is UB here (the subtraction may not give you what you intended but will still only give zero if the two `char`s are the same) ... – abligh Jan 31 '15 at 09:22
  • .. and re the other UB, why is *reading* from a `const char *` in the BSS cast to a `char *` UB? I thought in C (as opposed to C++) string literals were *not* `const` even though you can't write to them - see http://stackoverflow.com/questions/4493139/are-string-literals-const - am I wrong here? Obviously `const` is a good idea for optimisation of the caller. – abligh Jan 31 '15 at 09:25
  • @abligh Both OP and this answer work with testing only 1 string for length termination as the final result depend on another comparison for the final result value. – chux - Reinstate Monica Jan 31 '15 at 16:01
  • @abligh in the case of a `signed char` and `int` having the same range: OP entire question hangs on `(int) - (int)`. In a pathological situation where code is subtracting `char`s with the values `0x7FFFFFFF` and `0xFFFFFFFF` and the converted/promoted to `int`, the subtraction result arithmetically overflows and that is UB for signed integer math. In theory - I agree with the likely "the subtraction may not give you what you intended but will still only give zero if the two chars are the same", but the C spec does not provide that exception, it simply says on overflow, the result is UB. – chux - Reinstate Monica Jan 31 '15 at 16:11
  • @abligh Agree about `const` and UB - post amened. – chux - Reinstate Monica Jan 31 '15 at 16:21
  • 1
    Ah, because in C *signed* integer overflow is undefined behaviour as per C99 standard §3.4.3/1 - I agree. – abligh Jan 31 '15 at 16:44