14

As far as I know, the only difference between variable types such as char, int etc. is the amount of memory they occupy. I guess that they have no role in regulating what the variable they are holding represents. If that's true, in here, I have seen the following for strcmp:

The strcmp function compares the string s1 against s2, returning a value that has the same sign as the difference between the first differing pair of characters (interpreted as unsigned char objects, then promoted to int).

I want to ask why is the result promoted to int? Since chars are being compared, their difference fits into a char in all cases. So isn't promoting the result to int simply appending bunch of 0's at the end of the result? So, why is this done?

Utku
  • 2,025
  • 22
  • 42
  • 4
    Why would it return `char`? it returns the difference between the strings. – Iharob Al Asimi May 27 '15 at 16:10
  • Yes but doesn't it stop upon seeing the first non-matching character? Then the result must be fitting into a char. Is that wrong? – Utku May 27 '15 at 16:12
  • What if the difference is 255? – Random832 May 27 '15 at 16:12
  • 1
    see http://stackoverflow.com/questions/163254/on-32-bit-cpus-is-an-integer-type-more-efficient-than-a-short-type – wonko realtime May 27 '15 at 16:15
  • 6
    The difference between two 8-bit chars will be between -255 and 255, and that range doesn't fit in a char. – interjay May 27 '15 at 16:17
  • @interjay only the sign is the same (which probably was formulated this way in the hope that it would make explanation easier) but not more. strcmp only defines three results, where one is strictly lower than zero (and most of the times -1 maybe, but might change all the time afaik), one is exactly zero and the last one always higher than zero. See [here](http://www.cplusplus.com/reference/cstring/strcmp/) for an alternative description. – wonko realtime May 27 '15 at 16:29
  • 4
    @wonkorealtime I know that. I was only pointing out that the OP is wrong in assuming that the difference between two chars can be stored in a char. – interjay May 27 '15 at 16:30
  • Really weird question that I have ever seen... strcmp() is to supposed to return only -1, 0, 1... which looks pretty convenient to almost all the developers and it's the OP of this function. Seeing the first non-matching character won't even return each string's precedence. What's the use? – Nick Song May 28 '15 at 02:25
  • "Since chars are being compared, their difference fits into a char in all cases" - this is not true for signed chars. For example the difference between `127` and `-2` is `129` which doesn't fit into a signed char. Nor can the return value be `unsigned char` because they can't be negative. – M.M Jun 01 '15 at 02:48

4 Answers4

28

char may or may not be signed. strcmp must return a signed type, so that it can be negative if the difference is negative.

More generally, int is preferred for passing and returning simple numerical values, since it's defined as the "natural" size for such values and, on some platforms, is more efficient to deal with than smaller types.

Mike Seymour
  • 249,747
  • 28
  • 448
  • 644
  • On the requirement to represent negative values: I guess that the function might have returned, say, 0 for negative, 1 for equal and 2 for greater, since as far as I understand, the function does not give any guarantees about the return value will represent the difference between first pair of non-matching characters. If that's the case, the function might have returned char right? So I wonder if could there be a more fundamental reason? – Utku May 27 '15 at 16:41
  • @Utku: Yes, it could have encoded the result in a different way and not required a signed type. But it would probably have returned `int` anyway, since that's the natural size for a function return value. And the encoding would probably be less efficient than the obvious implementation of `if (c1 != c2) return c1 - c2;` – Mike Seymour May 27 '15 at 16:44
  • @Utku In mathematical terms `a < b implies strcmp(a,b) < 0` is more sound than `a < b implies strcmp(a,b) = 1` – UmNyobe May 27 '15 at 16:47
  • @UmNyobe Of course, but I speculated that it might have been done in the purpose of saving memory, unless there is another reason not to. – Utku May 27 '15 at 16:50
  • @MikeSeymour But why using a "natural size" for return values is so important? Could you point out a resource on the importance of using natural size for function return values? – Utku May 27 '15 at 17:01
  • @uktu It's not particularly important, just the natural thing to do given a type whose size has been chosen to be efficient to pass around. – Mike Seymour May 27 '15 at 17:31
  • 8
    A "natural size" `int` is a consequence of CPU design (see http://stackoverflow.com/questions/2331751/does-the-size-of-an-int-depend-on-the-compiler-and-or-processor). One aspect of CPU design consideration generally respected by compiler writers is "word alignment" (see http://stackoverflow.com/questions/381244/purpose-of-memory-alignment). This consideration of "word alignment" would then guide the implementation of "stack frames" (see http://stackoverflow.com/questions/10057443/explain-the-concept-of-a-stack-frame-in-a-nutshell). – rskar May 27 '15 at 17:37
  • 1
    Hence, for the sake of CPU efficiency and reliable compilation (not to mention skirting any chance of an unintended promotion or non-promotion, per platform implementation or compiler settings), the library writer(s) explicitly made the result returned into an `int`. That would be my guess, anyway. – rskar May 27 '15 at 17:37
11

Of course, despite the overflow possibility others have mentioned, it only needs to be able to return e.g. -1, 0, or 1 - which easily fit in a signed char. The real historical reason for this is that in the original version of C in the 1970s, functions couldn't return a char, and any attempt to do so resulted in returning an int.

In these early compilers, int was also the default type (many situations, including function return values as seen in main below, allowed you to declare something as int without actually using the int keyword), so it made sense to define any function that didn't specifically need to return a different type as returning an int.

Even now, a char return simply sign-extends the value into the int return register (r0 on pdp11, eax on x86), anyway. Treating it as a char would not have any performance benefit, whereas allowing it to be the actual difference rather than forcing it to be -1 or 1 did have a small performance benefit. And axiac's answer also makes the good point that it would have had to be promoted back to an int anyway, for the comparison operator. The reason for these promotions is also historical, incidentally, it was so that the compiler did not have to implement separate operators for every possible combination of char and int, especially since the comparison instructions on many processors only works with an int anyway.


Proof: If I make a test program on Unix V6 for PDP-11, the char type is silently ignored and an integer value outside the range is returned:

char foo() {
    return 257;
}

main() {
    printf("%d\n", foo());
    return 0;
}

# cc foo.c
# a.out
257
Random832
  • 37,415
  • 3
  • 44
  • 63
  • 1
    ... Why do you have a PDP-11 available to test stuff on? – user253751 May 27 '15 at 22:24
  • @immibis It's emulated, of course. And V7 does handle it properly; but my point stands that it costs extra instructions to convert it, and still more to make sure out-of-range values are handled correctly, and that there's no real benefit to returning a char. – Random832 May 27 '15 at 22:33
3

AFAIK, the standard C library doesn't have a single function that takes or returns values of type char. It has arguments and return types of type char* or const char* but not plain char.

Look for example at int isalpha(int c); for a more shocking instance.

I don't know why, but I can guess. Maybe it is due to the ABI. In any ABI I know, any argument or return value of type char is internally promoted to int anyway, so there is no point in doing it. It acutally will make the code less efficient, as you will need to do the truncating each time the function is used.

rodrigo
  • 94,151
  • 12
  • 143
  • 190
  • @interjay: That may be for `getchar`, but for `isalpha` is useless. It looks like they said: "look, we can accept EOF here and type `isalpha(getchar())`". – rodrigo May 27 '15 at 16:22
  • 1
    It's actually worse than useless, because it makes passing a `char` to `isalpha` wrong (you need to cast it to `unsigned char` first). – interjay May 27 '15 at 16:36
  • 1
    @interjay: You only need to cast it if it is a char-as-char. If you have a proper char-as-integer, then it is fine: 'char x=getchar(); isalpha(x);` wrong! `int x=getchar(); isalpha(x);" ok! Curiously, `isalpha('ñ')` is ok in C but undefined in C++! – rodrigo May 27 '15 at 17:34
  • @rodrigo I don't think that's true - `'ñ'` is an int in C, yes, but it will still have the same negative value. – Random832 May 27 '15 at 18:44
  • @Random832: Hmmm, I've just checked and you are right. Now I wonder why using GCC and Latin-1 encoding,`'ñ'` is `-15` and not `241`. – rodrigo May 27 '15 at 18:47
  • Also, for isalpha, it originally had isascii as a prerequisite (i.e. you needed to check `isascii(x) && isalpha(x)`), in the pre-ANSI days. – Random832 May 27 '15 at 18:47
  • @Random832: rodrigo was right the first time. `getchar()` returns either the negative value `EOF` (typically `-1`) **or** the next input character *as an `unsigned char` converted to an `int`. So assuming Latin-1, `getchar()` will return 241 for `'ñ'`. Storing that value in a plain `char (if plain `char` is signed) will convert the `241` to `(char)-15`. – Keith Thompson Jun 01 '15 at 03:05
  • @KeithThompson However, his allusion to it being different in C++ made it clear he was talking about using the _character literal_ `'ñ'` (which is a char in C++, but an int [but still with the value of a plain char] in C), not a char variable as a result of getchar or whatever. – Random832 Jun 01 '15 at 03:07
  • @KeithThompson: Rangom832 is right. I was mistaken about the value of the character literal. Now I don't understand why `getchar()` reading a latin-1 `'ñ'` returns `(int)241` but the literal `'ñ'` is `(int)-15`. It is madness! – rodrigo Jun 01 '15 at 07:53
  • @rodrigo: In C, a character constant is of type `int`; the value is that of the corresponding `char` value. If plain `char` is signed, certain character constants can be negative. In C++, the value of `'ñ'` is also negative (again assuming plain `char` is signed); the difference is that its type is `char`. – Keith Thompson Jun 01 '15 at 14:58
  • @KeithThompson: Yeah, I know that now. And `getchar()` is defined as reading "_the next byte as an **unsigned char** converted to an int_" so everything is perfectly defined. I just find surprising that I have to write `(unsigned char)'ñ'` to get the proper value of a character. – rodrigo Jun 01 '15 at 16:38
  • @rodrigo: That's assuming you're using Latin-1. With my setup, for example, I'm not even sure how to enter a Latin-1 `'ñ'` character in a text file; if I type or copy-and-paste it, I get a UTF-8 2-byte sequence. – Keith Thompson Jun 01 '15 at 17:14
3

One possible reason why strcmp() promotes the values it returns to int is to spare a processor instruction in the calling code.

Usually (always?) the value returned by strcmp() is used with a comparison operator.

Let's see what happens with the operands of comparison operators.

Usual arithmetic conversions

The arguments of the following arithmetic operators undergo implicit conversions for the purpose of obtaining the common real type, which is the type in which the calculation is performed:

  • binary arithmetic *, /, %, +, -
  • relational operators <, >, <=, >=, ==, !=
  • binary bitwise arithmetic &, ^, |
  • the conditional operator ?:

...

4) Otherwise, both operands are integers. In that case,

First of all, both operands undergo integer promotions.
...

(source: http://en.cppreference.com/w/c/language/conversion#Usual_arithmetic_conversions)

Integer promotions

Integer promotion is the implicit conversion of a value of any integer type with rank less or equal to rank of int or of a bit field of type _Bool, int, signed int, unsigned int, to the value of type int or unsigned int.

(source: http://en.cppreference.com/w/c/language/conversion#Integer_promotions)

Back to strcmp()

As you can see from the quotes above, a possible char value returned by strcmp() is promoted to int anyway.

Why did the creators of C chose to return an int?

For a very simple reason: because the promotion is going to happen anyway and because (at least) one processor instruction is needed to perform the promotion, its more convenient to add that instruction to the code of strcmp() (i.e. in a single place) than everywhere the strcmp() function is called.

Back in the 70s both the memory and the CPU were very valuable resources. An optimization that now seems insignificant (a couple of bytes of memory saved here and there, maybe in several dozen places in the code) had much more importance back then.

Update:

On a second thought, I think the historical reasons provided by this answer and this answer are more accurate than mine.

Community
  • 1
  • 1
axiac
  • 68,258
  • 9
  • 99
  • 134
  • So, the reason is the fact that arithmetic operators always operate on int, and any other type that is an operand of an arithmetic operator is _always_ implicitly converted to int? – Utku May 27 '15 at 16:48
  • @Utku Not _always_ per se, only when it's char or short. – Random832 May 27 '15 at 18:42