Is the strcasecmp algorithm flawed?

Question

I am trying to reimplement the strcasecmp function in C and I noticed what appears to be an inconsistency in the comparison process.

From man strcmp

The strcmp() function compares the two strings s1 and s2. The locale is not taken into account (for a locale-aware comparison, see strcoll(3)). It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

From man strcasecmp

The strcasecmp() function performs a byte-by-byte comparison of the strings s1 and s2, ignoring the case of the characters. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

int strcmp(const char *s1, const char *s2);
int strcasecmp(const char *s1, const char *s2);

Given, this information, I don't understand the result of the following code:

#include <stdio.h>
#include <string.h>

int main()
{
    // ASCII values
    // 'A' = 65
    // '_' = 95
    // 'a' = 97

    printf("%i\n", strcmp("A", "_"));
    printf("%i\n", strcmp("a", "_"));
    printf("%i\n", strcasecmp("A", "_"));
    printf("%i\n", strcasecmp("a", "_"));
    return 0;
}

Ouput:

-1  # "A" is less than "_"
1   # "a" is more than "_"
2   # "A" is more than "_" with strcasecmp ???
2   # "a" is more than "_" with strcasecmp

It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.

Can someone explain this behaviour? Shouldn't the first and third lines be identical?

Thank you in advance!

PS:
I am using gcc 9.2.0 on Manjaro.
Also, when I compile with the -fno-builtin flag I get instead:

I guess it's because the program does not use gcc's optimised functions, but the question remains.

Add another test case to your set: `printf("%i\n", strcasecmp("a", "_"));` This should presumably have the same result as `printf("%i\n", strcasecmp("A", "_"));` But that means that _one_ of these two case-insensitive calls is going to disagree with its case-sensitive counterpart. — anton.burger, Feb 21 '20 at 16:15
It seems the description of `strcasecmp` you're refering to is not accurate. More details in the upvoted answers. — Jabberwocky, Feb 21 '20 at 16:58
It's the only thing that makes sense. A function that says `A < _ && a > _ && A == a` would cause so many problems. — ikegami, Feb 21 '20 at 17:25
Aside: "I am trying to reimplement the strcasecmp function in C " --> Although code not shown, be sure to compare "as if"`unsigned char`. C17/18 "String handling " --> "For all functions in this subclause, each character shall be interpreted as if it had the type `unsigned char`". This makes a difference once `char` values are outside ASCII range 0-127. — chux - Reinstate Monica, Feb 21 '20 at 17:37
@chux-ReinstateMonica Thanks for the tip. I actutally want to implement it in assembly, but I didn't mention it to avoid confusion, as it wasn't necessary. — Haltarys, Feb 21 '20 at 18:14
writing this in assembly is a bad idea. You won't be able to beat optimized library functions with SIMD or other special techniques — phuclv, Feb 22 '20 at 14:10
On the differences in the outputs with built-ins and without: Both say the same, as their results are identically <0 and >0, and you don't have an example for ==0. But you can see the algorithms shine through: some of the returned values are the differences of the first non-equal character. — the busybee, Feb 22 '20 at 15:38

Andrew Henle · Accepted Answer · 2020-02-23T15:22:17.593

29

The behavior is correct.

Per the POSIX str\[n\]casecmp() specification:

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

That is also part of the NOTES section of the Linux man page:

The POSIX.1-2008 standard says of these functions:

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

Why?

As @HansOlsson pointed out in his answer, doing case-insensitive comparisons between only letters and allowing all other comparisons to have their "natural" results as done in strcmp() would break sorting.

If 'A' == 'a' (the definition of a case-insensitive comparison) then '_' > 'A' and '_' < 'a' (the "natural" results in the ASCII character set) can not both be true.

edited Feb 23 '20 at 15:22

answered Feb 21 '20 at 16:19

Andrew Henle

32,625
3
24
56

Doing case-insensitive comparisons between only letters would not result in `'_' > 'A' && '_' < 'a'`; doesn't seem like the best example. – Asteroids With Wings Feb 22 '20 at 14:44
1

@AsteroidsWithWings Those are the characters used in the question. And if `'a' == 'A'` **by definition**, if you do a comparison between the "natural" values of `'a'`, `'A'`, and `'_`', you **can not** do a case-insensitive comparison between `'A'` and `'a'` to get equality and get consistent sort results. – Andrew Henle Feb 22 '20 at 18:54
I'm not disputing that, but the specific counter-example you provided doesn't seem to be relevant. – Asteroids With Wings Feb 22 '20 at 19:27
@AsteroidsWithWings Go through the mental exercise of building a binary tree from `'a'`, `'A'`, and `'_'`, going through all 6 orders of insertion into the tree, and comparing the results from the as-specified "always lowercase letters" to the question's proposed "only convert letters when it's a letter-to-letter comparison". For example, using the latter algorithm and starting with `'_'`, `'a'` and `'A'` wind up on opposite sides of the tree yet they're defined as equal. The "only convert letters to lower case in letter-letter comparisons" algorithm is **broken** and those 3 chars show that. – Andrew Henle Feb 23 '20 at 12:49
Okay, then I suggest demonstrating that in the answer because at the moment it just jumps to pointing out that _"`'_' > 'A' ` and `'_' < 'a'` can not both be true"_ without telling us why we should ever have thought it would be. (That's a task for the answerer, not for one of millions of readers.) – Asteroids With Wings Feb 23 '20 at 15:07
@AsteroidsWithWings The entire sentence states "**If `'A' == 'a'` (the definition of a case-insensitive comparison)** then `'_' > 'A'` and `'_' < 'a'` can not both be true." – Andrew Henle Feb 23 '20 at 15:23
I know what the sentence says... I can read... that did not address my comment at all. Alright, just forget it. Trying to help you improve your answer but I can only lead a horse to water! Have a good one. – Asteroids With Wings Feb 23 '20 at 17:02
@AsteroidsWithWings *I know what the sentence says...* Then don't selectively misquote something that no one else has a problem with in a way that implies something clearly stated was left out. No one is preventing you from posting what you think is a better answer. – Andrew Henle Feb 23 '20 at 17:44
I never claimed that that first part of the sentence was left out. I'm saying it's insufficient for the purpose of establishing why the second part is useful/interesting/relevant. Again, I'm not saying it isn't, just that your answer as currently written _does not explain that_. I'm sorry that you did not like my feedback. Furthermore, what on earth does this have to do with me writing a hypothetical additional answer? Too many non sequiturs. – Asteroids With Wings Feb 23 '20 at 18:29

score 22 · Answer 2 · answered Feb 21 '20 at 16:19

22

Other links, http://man7.org/linux/man-pages/man3/strcasecmp.3p.html for strcasecmp say that converting to lower-case is the correct behavior (at least in POSIX locale).

The reason for that behavior is that if you use strcasecmp to sort an array of strings it is needed to get reasonable results.

Otherwise if you try to sort "A", "C", "_", "b" using e.g., qsort the result would depend on the order of comparisons.

answered Feb 21 '20 at 16:19

Hans Olsson

11,123
15
38

3

*Otherwise if you try to sort "A", "C", "_", "b" using e.g., qsort the result would depend on the order of comparisons.* Good point. That's likely the reason POSIX specifies the behavior. – Andrew Henle Feb 21 '20 at 16:22
6

More concretely, you need a [total order](https://en.wikipedia.org/wiki/Total_order) for sorting, which would not be the case if you define the comparison as in the question (since it wouldn't be transitive). – Bernhard Barker Feb 22 '20 at 17:57

Adrian Mole · Answer 3 · 2020-02-21T16:28:47.157

It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.

That's correct - and it's what the strcasecmp() function should do! It is a POSIX function, rather than part of the C Standard but, from the "The Open Group Base Specifications, Issue 6":

In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. The results are unspecified in other locales.

Incidentally, this behaviour is also pertintent to the _stricmp() function (as used in Visual Studio/MSCV):

The _stricmp function ordinally compares string1 and string2 after converting each character to lowercase, and returns a value indicating their relationship.

anastaciu · Answer 4 · 2020-02-22T18:30:34.137

2

The ASCII decimal code for A is 65 for _ is 95 and for a is 97, so strcmp() it's doing what it's suppose to do. Lexicographically speaking _ is smaller then a and bigger than A.

strcasecmp() will regard A as being a*, and since a is bigger than _ the output is also correct.

*The POSIX.1-2008 standard says of these functions (strcasecmp() and strncasecmp()):

When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.

Source: http://man7.org/linux/man-pages/man3/strcasecmp.3.html

edited Feb 22 '20 at 18:30

answered Feb 21 '20 at 16:17

anastaciu

23,467
7
28
53

3

OP's point is that `A` is "bigger" than `_` when comparing case-insensitively, and wonders why the result is not the same as when comparing case-sensitively. – anton.burger Feb 21 '20 at 16:21
6

The statement `Since `strcasecmp()` is case insensitive it will regard A as being a` is an invalid deduction. A case-insensitive routine could treat all uppercase letters as if they were lowercase letters, could treat all lowercase letters as if they were uppercase letters, or could treat each uppercase letter as equal to its corresponding lowercase letter and vice-versa but still compare them to non-letter characters with their raw values. This answer does not state a reason for preferring any of those possibilities (the correct reason for which is the documentation says to use lowercase). – Eric Postpischil Feb 21 '20 at 17:20
@EricPostpischil The POSIX.1-2008 standard says of these functions (strcasecmp() and strncasecmp()): When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified. – anastaciu Feb 21 '20 at 17:26

Is the strcasecmp algorithm flawed?

4 Answers4