3

I thought strcmp was supposed to return a positive number if the first string was larger than the second string. But this program

#include <stdio.h>
#include <string.h>

int main()
{
    char A[] = "A";
    char Aumlaut[] = "Ä";
    printf("%i\n", A[0]);
    printf("%i\n", Aumlaut[0]);
    printf("%i\n", strcmp(A, Aumlaut));
    return 0;
}

prints 65, -61 and -1.

Why? Is there something I'm overlooking?
I thought that maybe the fact that I'm saving as UTF-8 would influence things.. You know because the Ä consists of 2 chars there. But saving as an 8-bit encoding and making sure that the strings both have length 1 doesn't help, the end result is the same.
What am I doing wrong?

Using GCC 4.3 under 32 bit Linux here, in case that matters.

Charles
  • 50,943
  • 13
  • 104
  • 142
Mr Lister
  • 45,515
  • 15
  • 108
  • 150

6 Answers6

2

strcmp and the other string functions aren't actually utf aware. On most posix machines, C/C++ char is internally utf8, which makes most things "just work" with regards to reading and writing and provide the option of a library understanding and manipulating the utf codepoints. But the default string.h functions are not culture sensitive and do not know anything about comparing utf strings. You can look at the source code for strcmp and see for yourself, it's about as naïve an implementation as possible (which means it's also faster than an internationalization-aware compare function).

I just answered this in another question - you need to use a UTF-aware string library such as IBM's excellent ICU - International Components for Unicode.

Community
  • 1
  • 1
Mahmoud Al-Qudsi
  • 28,357
  • 12
  • 85
  • 125
  • I realise that - that's why I said I also tried saving in another charset (Windows-1252 in this case) where `'Ä'` is one char with value -60. But that didn't help, it still prints -1. – Mr Lister May 04 '12 at 08:39
1

Saving as an 8-bit ASCII encoding, 'A' == 65 and 'Ä' equals whatever -61 is if you consider it to be an unsigned char. Anyway, 'Ä' is strictly positive and greater than 2^7-1, you're just printing it as if it were signed.

If you consider 'Ä' to be an unsigned char (which it is), its value is 195 in your charset. Hence, strcmp(65, 195) correctly reports -1.

Philip
  • 5,795
  • 3
  • 33
  • 68
  • Are you saying that `strcmp` treats its arguments as _unsigned chars?_ I never read anything about that. – Mr Lister May 04 '12 at 08:41
  • 1
    @MrLister: No. I'm saying that it's implementation-defined whether `char` is really `signed char` or `unsigned char`. In your case, it seems to be `unsigned char`, but you're using `%i` to print its value. Tell `printf()` that you are printing an `unsigned char` instead of a `signed int`. – Philip May 04 '12 at 08:42
  • @MrLister: That's not true. You tell printf() to consider its argument to be `signed int`, when its argument is indeed an `unsigned char`. Use the correct format specifier, which in this case is `%c`. – Philip May 04 '12 at 08:49
  • No. This: `printf("%u %u %u\n", (unsigned char)'Ä', (signed char)'Ä', (char)'Ä');` will print 196 for the unsigned char, but 4294967236 for the two others, proving that (char) has the same signage as (signed char). – Mr Lister May 04 '12 at 08:52
  • This is interesting, since casting the value to equal-sized integer types shouldn't affect the output. Besides, 4294967236 is not representable in 8 bits. – Philip May 04 '12 at 08:56
  • My guess is that your `strcmp()` implementation is smart enough to perform the comparison using `unsigned char`s. – Philip May 04 '12 at 08:57
  • That's because it _sign-extends_ the argument when printing as a 32-bit number. (Or for the pedantic, when it pushed an argument onto the stack for a variadic function.) – Mr Lister May 04 '12 at 08:59
  • If it is smart enough to do unsigned comparison on signed chars, shouldn't that be documented somewhere? I don't like surprises like that. – Mr Lister May 04 '12 at 09:01
  • True. On the other hand, I don't know of any charset with negative values. – Philip May 04 '12 at 09:02
  • @MrLister: see http://fossies.org/dox/glibc-2.15/strcmp_8c_source.html (glibc's `strcmp()` implementation). It indeed casts to `unsigned char` before comparison. – Philip May 04 '12 at 09:05
  • That's beside the point, because it doesn't matter what charset is used. You could write `'\xC4'` and have the same anomalies, on any system. – Mr Lister May 04 '12 at 09:06
  • ...which is still a miracle to me. `(unsigned char)'Ä'`, `(signed char)'Ä'` and `(char)'Ä'` all have the same bit pattern on a 2's complement machine. When printf() performs sign-extension on all three `%u` arguments, the result should always be the same. Note that printf() doesn't know about the casts... WTF? (: – Philip May 04 '12 at 09:12
  • I meant my last comment as a reply to "charset with negative values". The fossies page is news to me. – Mr Lister May 04 '12 at 09:13
  • @Philip Sign-extension only happens for negative values. What happens are the _integer promotions_. When the value of a promoted type can be represented in the target type, the promotion is value-preserving. In Windows-1252, 'Ä' is 196 as an unsigned char, and -60 as a signed char. Both values are representable as `int`, so the call is `printf("%u %u %u\n",196,-60,-60);`. Although strictly, it is undefined behaviour - `%u` requires an `unsigned int` -, you can pretty much rely on the bit-pattern of the `int`s just being interpreted as `unsigned int`s. – Daniel Fischer May 04 '12 at 09:27
1

strcmp() takes chars as unsigned ASCII values. So, your A-with-double-dots isn't char -61, it's char 195 (or maybe 196, if I've got my math wrong).

mjfgates
  • 3,351
  • 1
  • 18
  • 15
  • That's what it appears to do, yes. But why? – Mr Lister May 04 '12 at 08:55
  • 1
    @MrLister In an 8-bit encoding like iso-8859-1 or Windows-1252, the code points are numbered 0-255. Treating the contents of the strings as `unsigned char` preserves the ordering of the code points, treating them as signed doesn't. Similarly, with an encoding like utf-8, a higher unicode code-point number yields a lexicographically larger byte sequence when considering the bytes as unsigned, but not when considering them as signed. Probably that's the reason why `strcmp` uses `unsigned char`s. – Daniel Fischer May 04 '12 at 09:38
  • @DanielFischer Makes sense. So you mean it's not even implementation dependent? Oh well, I suppose I can live with that, but I'd really have appreciated that if the manual would have said so. – Mr Lister May 04 '12 at 09:43
  • @MrLister No, it's standard-mandated, see my answer. I agree it would be nice if the man-page said so. – Daniel Fischer May 04 '12 at 09:50
1

The strcmp and similar comparison functions treat the bytes in the strings as unsigned chars, as specified by the standard in section 7.24.4, point 1 (was 7.21.4 in C99)

The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

(emphasis mine).

The reason is probably that such an interpretation maintains the ordering between code points in the common encodings, while interpreting them a s signed chars doesn't.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • More significantly, if one string has a zero byte in a certain position while another string has something else, the first string should be compare before the second, even if that something else, interpreted as a `char`, would be negative. One could have a special rule to define the ranking as 0, then -128 to -1, then 1 to 127, but that would be a bit weird. – supercat Jan 30 '13 at 22:45
0

Check the strcmp manpage:

The strcmp() function compares the two strings s1 and s2. It returns
an integer less than, equal to, or greater than zero if s1 is found,
respectively, to be less than, to match, or be greater than s2.
ott--
  • 5,642
  • 4
  • 24
  • 27
  • But it doesn't say that -60 is greater than 65. Which is why I asked the question. – Mr Lister May 04 '12 at 08:53
  • It says -1 becasue the string "A" is less than "Ä". You see the -61 because you print only the first byte of the "Ä" string. – ott-- May 04 '12 at 09:21
-1

To do string handling correctly in C when the input character set exceeds UTF8 you should use the standard library's wide-character facilities for strings and i/o. Your program should be:

#include <wchar.h>
#include <stdio.h>

int main()
{
    wchar_t A[] = L"A";
    wchar_t Aumlaut[] = L"Ä";
    wprintf(L"%i\n", A[0]);
    wprintf(L"%i\n", Aumlaut[0]);
    wprintf(L"%i\n", wcscmp(A, Aumlaut));
    return 0;
}

and then it will give the correct results (GCC 4.6.3). You don't need a special library.

Mike Kinghan
  • 55,740
  • 12
  • 153
  • 182