78

I am currently writing a C program that requires frequent comparisons of string lengths so I wrote the following helper function:

int strlonger(char *s1, char *s2) {
    return strlen(s1) - strlen(s2) > 0;
}

I have noticed that the function returns true even when s1 has shorter length than s2. Can someone please explain this strange behavior?

Kevin J. Chase
  • 3,856
  • 4
  • 21
  • 43
Adrian Monk
  • 1,061
  • 9
  • 17
  • 27
    That's a Fortran-66-ish way of writing `return strlen(s1) > strlen(s2);`. – Jonathan Leffler May 28 '12 at 00:55
  • @KeithThompson: The rewrite with `>` is what was plausibly intended; the original returns 0 if the strings are the same length and 1 if they are different. – Jonathan Leffler May 28 '12 at 03:55
  • 11
    @TimThomas: Why are you offering the bounty on this question? You say that it has not received enough attention, but it appears you are quite happy with [Alex Lockwood's answer](http://stackoverflow.com/a/10474780/623041). Not sure what more it takes to win the bounty! :) – eggyal May 28 '12 at 18:55
  • 11
    It was an accident, I didn't know what a bounty was lol. -_- Kind of embarrassing... – Adrian Monk May 29 '12 at 04:29
  • 5
    I guess it's good for Alex Lockwood though because his great answer will get more attention... so everyone **up-vote Alex Lockwood's answer!!** :D – Adrian Monk May 29 '12 at 04:30
  • 5
    I think it is better for @TimThomas to keep the bounty open until last allowable date, so that his question too get some attention..He unknowingly lost his his 100 reputation points, let him get some back.. – Krishnabhadra Jun 01 '12 at 06:43
  • I guess the behavior is explained better [here](http://stackoverflow.com/questions/7221409/is-unsigned-integer-subtraction-defined-behavior). – Alexandru C. Jun 01 '12 at 19:54
  • 1
    why is the Java tag here? There's no unsigned type in Java – phuclv Dec 03 '14 at 16:50

3 Answers3

177

What you've come across is some peculiar behavior that arises in C when handling expressions that contain both signed and unsigned quantities.

When an operation is performed where one operand is signed and the other is unsigned, C will implicitly convert the signed argument to unsigned and perform the operations assuming the numbers are nonnegative. This convention often leads to nonintuitive behavior for relational operators such as < and >.

Regarding your helper function, note that since strlen returns type size_t (an unsigned quantity), the difference and the comparison are both computed using unsigned arithmetic. When s1 is shorter than s2, the difference strlen(s1) - strlen(s2) should be negative, but instead becomes a large, unsigned number, which is greater than 0. Thus,

return strlen(s1) - strlen(s2) > 0;

returns 1 even if s1 is shorter than s2. To fix your function, use this code instead:

return strlen(s1) > strlen(s2);

Welcome to the wonderful world of C! :)


Additional Examples

Since this question has recently received a lot of attention, I'd like to provide a few (simple) examples, just to ensure that I am getting the idea across. I will assume that we are working with a 32-bit machine using two's complement representation.

The important concept to understand when working with unsigned/signed variables in C is that if there is a mix of unsigned and signed quantities in a single expression, signed values are implicitly cast to unsigned.

Example #1:

Consider the following expression:

-1 < 0U

Since the second operand is unsigned, the first one is implicitly cast to unsigned, and hence the expression is equivalent to the comparison,

4294967295U < 0U

which of course is false. This is probably not the behavior you were expecting.

Example #2:

Consider the following code that attempts to sum the elements of an array a, where the number of elements is given by parameter length:

int sum_array_elements(int a[], unsigned length) {
    int i;
    int result = 0;

    for (i = 0; i <= length-1; i++) 
        result += a[i];

    return result;
}

This function is designed to demonstrate how easily bugs can arise due to implicit casting from signed to unsigned. It seems quite natural to pass parameter length as unsigned; after all, who would ever want to use a negative length? The stopping criterion i <= length-1 also seems quite intuitive. However, when run with argument length equal to 0, the combination of these two yields an unexpected outcome.

Since parameter length is unsigned, the computation 0-1 is performed using unsigned arithmetic, which is equivalent to modular addition. The result is then UMax. The <= comparison is also performed using an unsigned comparison, and since any number is less than or equal to UMax, the comparison always holds. Thus, the code will attempt to access invalid elements of array a.

The code can be fixed either by declaring length to be an int, or by changing the test of the for loop to be i < length.

Conclusion: When Should You Use Unsigned?

I don't want to state anything too controversial here, but here are some of the rules I often adhere to when I write programs in C.

  • DON'T use just because a number is nonnegative. It is easy to make mistakes, and these mistakes are sometimes incredibly subtle (as illustrated in Example #2).

  • DO use when performing modular arithmetic.

  • DO use when using bits to represent sets. This is often convenient because it allows you to perform logical right shifts without sign extension.

Of course, there may be situations in which you decide to go against these "rules". But most often than not, following these suggestions will make your code easier to work with and less error-prone.

Alex Lockwood
  • 83,063
  • 39
  • 206
  • 250
  • 47
    Another fine example how writing *less* makes the program *more* correct. – Kerrek SB May 06 '12 at 22:34
  • 1
    I have a follow up question, if that is OK. Why does C automatically cast to unsigned? I mean, I know every language has it's reasons for doing things the way it does... but what exactly is the reason in this case? It seems like really weird/dangerous behavior to me! – Adrian Monk May 06 '12 at 23:08
  • 3
    @TimThomas It has to cast one way or the other, and casting unsigned to signed would lose information, i.e. half the value space. – user207421 May 07 '12 at 00:06
  • 1
    hmm... that seems somewhat obvious now that you mention it :). thanks! – Adrian Monk May 07 '12 at 00:28
  • 7
    Strictly, the subtraction is performed between two `size_t` values, which are guaranteed unsigned, and unsigned arithmetic wraps modulo the appropriate power of two. The only place where signed/unsigned conversion is possible is in the `result > 0` part, where `result` is the `size_t` value from the subtraction of the two sizes. – Jonathan Leffler May 28 '12 at 00:59
  • 1
    Ah, you are correct. `strlen(s1) - strlen(s2)` is not cast to an `unsigned int`. Rather, it has type `unsigned int` because it is the difference of two `size_t` variables. I've updated my answer to make it a bit more exact. Thanks! – Alex Lockwood May 28 '12 at 01:17
  • 9
    It doesn't *cast*, it *converts*. The term *cast* refers only to an explicit cast operator, consisting of a parenthesized type name. A cast operator explicitly specifies a conversion; a conversion may be either explicit or implicit. – Keith Thompson May 28 '12 at 02:35
  • unsigned types below int will always be converted to int if int can store their value unchanged, not the other way around. – Johannes Schaub - litb Jun 02 '12 at 07:34
  • @AlexLockwood: `strlen(s1) - strlen(s2)` has type `size_t` because both sides of `-` are `size_t`, thus no conversion is necessary. Also, `size_t` is not always `unsigned int`; I've worked on several platforms where it was `unsigned long` because `sizeof(long) > sizeof(int)` on those platforms. – Mike DeSimone Jun 02 '12 at 11:45
  • @Jonathan Leffler, Kieth Thompson, and Mike DeSimone: thanks for pointing out the subtleties in my previous comments/posts. I have updated my original post to make it as concise as possible. I have also used the term *implicit casting* to refer to the "conversion" that you all referred to at some point :P. Sorry, I didn't make this clear to begin with! – Alex Lockwood Jun 02 '12 at 14:25
  • 1
    That's why decent general purpose frameworks use `int` or `long` for all integers, even if they would not make sense when negative. And then you try to interoperate with, say, C++ standard library and it's usually unsigned lengths, and you're ripping your hair out. Yes, their approach is "pure" on theoretical grounds, but wholly impractical... – Kuba hasn't forgotten Monica Jun 03 '12 at 03:13
  • 2
    I find negative integers sufficiently rare in my code that I take the opposite approach and use `unsigned int` unless there's some concrete reason not to. This has the benefit that all operations are well-defined (even "wrap-around"), though admittedly it can require care when dealing with some inequalities. – Joshua Green Jun 03 '12 at 11:48
  • 2
    This would be better suited as an answer to a question where signed/unsigned conversion matters. For this question signed/unsigned conversion is rather irrelevant. The problem here really is only that there are no negative unsigned numbers and therefore the subtraction wraps around. – sth Jun 03 '12 at 12:44
25

strlen returns a size_t which is a typedef for an unsigned type.

So,

(unsigned) 4 - (unsigned) 7 == (unsigned) - 3

All unsigned values are greater than or equal to 0. Try converting the variables returned by strlen to long int.

Alex Lockwood
  • 83,063
  • 39
  • 206
  • 250
pmg
  • 106,608
  • 13
  • 126
  • 198
  • ptrdiff_t is the correct portable cast. It's common for long int to be a 32-bit signed integer on 64-bit systems (on 64-bit systems, it's the pointers that are 64-bits). In fact, both Visual C++ and gcc for x86 and x86_64 use 32-bit longs. – Mr Fooz Jun 02 '12 at 00:32
  • 3
    I thought `ptrdiff_t` was for subtraction of pointers, not subtraction of `size_t` values... – Mr Lister Jun 02 '12 at 05:06
  • 4
    There is no POSIX type for "subtraction of `size_t` values"; C defines it as simply `size_t` since it's an integral type and the types match. You could argue that that's `off_t`, but that's actually for file offsets. So the best you'll do is reason that since `size_t` is required to hold any index the platform can handle, then it can also represent any pointer value, since it could be used to index bytes from `0`. Thus `ptrdiff_t` needs to be the same number of bits as `size_t`, making it simply the `signed` version of `size_t`. – Mike DeSimone Jun 02 '12 at 11:40
1

Alex Lockwood's answer is the best solution (compact, clear semantics, etc).

Sometimes it does make sense to explicitly convert to a signed form of size_t: ptrdiff_t, e.g.

return ptrdiff_t(strlen(s1)) - ptrdiff_t(strlen(s2)) > 0;

If you do this, you'll want to be certain that the size_t value fits in a ptrdiff_t (which has one fewer mantissa bits).

Community
  • 1
  • 1
Mr Fooz
  • 109,094
  • 6
  • 73
  • 101