12

According to the standard:

The values of the members of the execution character set are implementation-defined.
(ISO/IEC 9899:1999 5.2.1/1)

Further in the standard:

...the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
(ISO/IEC 9899:1999 5.2.1/3)

It appears that the standard requires that the execution character set includes the 26 uppercase and 26 lowercase letters of the Latin alphabet, but I see no requirement that these characters be ordered in any way. I only see an order stipulation for the decimal digits.

This would seem to imply that, strictly speaking, there is no guarantee that 'a' < 'b'. Now, the letters of the alphabet are in order in each of ASCII, UTF-8, and EBCDIC. But for ASCII and UTF-8 we have 'A' < 'a', while for EBCDIC we have 'a' < 'A'.

It might be nice to have a function in ctype.h that compares alphabetic characters portably. Short of this or something similar, it seems to me that one must look in the locale to find the value of CODESET and proceed accordingly, but this doesn't seem simple.

My gut tells me that this is almost never an issue; for most cases alphabetical characters can be handled by converting to lowercase, because for the most commonly used character sets the letters are in order.

The question: given two chars

char c1;
char c2;

is there a simple, portable way to determine if c1 precedes c2 alphabetically? Or do we assume that the lowercase and uppercase characters always occur in sequence, even though this does not appear to be guaranteed by the standard?

To clarify any confusion, I am really just interested in the 52 letters of the Latin alphabet that are guaranteed by the standard to be in the execution character set. I realize that other sets of letters are important, but it seems that we can't even know about the ordering of this small subset of letters.

Edit

I think that I need to clarify a bit more. The issue, as I see it, is that we commonly think of the 26 lowercase letters of the Latin alphabet as being ordered. I would like to be able to assert that 'a' comes before 'b', and we have a convenient way of expressing this in code as 'a' < 'b', when we give 'a' and 'b' integral values. But the standard gives no assurances that the above code will evaluate as expected. Why not? The standard does guarantee this behavior for the digits 0-9, and this seems sensible. If I want to determine if one letter-char precedes another, say for sorting purposes, and if I want this code to be truly portable, it seems like the standard offers no help. Now I have to rely on the convention that ASCII, UTF-8, EBCDIC, etc. have adopted that 'a' < 'b' should be true. But this isn't really portable unless the only character sets used rely on this convention; this may be true.

This question originated for me in another question thread: Check if a letter is before or after another letter in C. Here, a few people suggested that you could determine the order of two letters stored in chars using inequalities. But one commenter pointed out that this behavior is not guaranteed by the standard.

Community
  • 1
  • 1
ad absurdum
  • 19,498
  • 5
  • 37
  • 60
  • 3
    [This answer](https://stackoverflow.com/questions/1469711/converting-letters-to-numbers-in-c/35642655#35642655) may help. – user3386109 Oct 07 '16 at 19:27
  • 1
    the standard doesn't say anything about what order the "alphabet" has to be in, but given that there's literally millions of C programs out there written with direct `if (var < 'a')`-type comparisons, you have to pretty much assume that all charsets used will have their characters listed in ascending order. – Marc B Oct 07 '16 at 19:27
  • If you make @MarcB assumption in your program make sure you check it using an assert. – Jorge Bellon Oct 07 '16 at 19:28
  • @MarcB Not to mention letter-to-number conversions like `letterVal = letter - 'A'` – dbush Oct 07 '16 at 19:29
  • 6
    What is the repertoire of characters you care about? Plain old ISO-646? Polish? Korean? Japanese? Devanagari ? – bmargulies Oct 07 '16 at 19:36
  • ASCII characters? Sure, throw them into an array. Unicode characters? Not really, unless you want to talk to the unicode consortium about that. – Qix - MONICA WAS MISTREATED Oct 07 '16 at 19:39
  • Start your translation unit with `_Static_assert('a' < 'b'); _Static_assert('b' < 'c'); _Static_assert('c' < 'd');` etcetera. It's only two times 25 lines in a header file. –  Oct 07 '16 at 19:48
  • I do not think anything both simple *and* portable is possible... – paulotorrens Oct 07 '16 at 19:59
  • 2
    @bmargulies-- I suppose that I am really just interested in the 52 letters of the Latin alphabet that the standard guarantees must be in the execution character set. I will edit my question to clarify this. – ad absurdum Oct 07 '16 at 20:01
  • @DavidBowling So is it "ABC...acb..." or "acb...ABC..." or AaBbCc..." or "aAbBcC..." or ? – chux - Reinstate Monica Oct 07 '16 at 20:04
  • 1
    Curious, so if you find a satisfactory way to order 2 `char`, how is that to be used? What is the next higher-level goal? – chux - Reinstate Monica Oct 07 '16 at 20:39
  • @chux-- I don't really have a "higher-level goal." I added an edit to my question in an attempt to clarify my thinking and motivations. – ad absurdum Oct 07 '16 at 21:05
  • Agree with @chux in that without a reason for a specific ordering, it doesn't seem to make much difference. It only seems to matter if a particular order is consistent for a particular environment. – user2338816 Oct 08 '16 at 04:17
  • @user2338816-- the issue is one of portability. A simple comparison of the integer values of the `char`s in question is not guaranteed by the standard to work, and is not strictly portable, though it is common. I wondered if there is a simple, portable solution. The answers and comments I received gave a few solutions, some pretty simple, others less so. – ad absurdum Oct 08 '16 at 04:34

5 Answers5

10

strcoll is designed for this purpose. Simply set up two strings of one character each. (normally you want to compare strings, not characters).

Malcolm McLean
  • 6,258
  • 1
  • 17
  • 18
  • Nice answer using the standard - UV. To compare `char`, could use the compound literal trick of my answer. – chux - Reinstate Monica Oct 07 '16 at 20:01
  • 1
    `strcoll` is the obvious solution, but it depends on having the appropriate locale setting. The standard doesn't specify any locales other than `"C"` and `""` (which might be the same). – Keith Thompson Oct 07 '16 at 20:36
  • Of course things depend on the appropriate locale setting. Both German and Swedish consider "Ö" a character. Swedish sorts "Ö" after "Z", German sorts "Ö" like "Oe". So, yes, you need the appropriate locale to sort them *portably*... – DevSolar Aug 01 '18 at 13:01
6

There are historically used codes that don't simply order the alphabet. Baudot, for example, puts vowels before consonants, so 'A' < 'B', but 'U' < 'B' as well.

There are also codes like EBCDIC that are ordered, but with gaps. So in EBCDIC, 'I' < 'J', but 'I' + 1 != 'J'.

Lee Daniel Crocker
  • 12,927
  • 3
  • 29
  • 55
  • 2
    Does any compiler use the Baudot encoding? `strcmp` will be a challenge for such a compiler. – R Sahu Oct 07 '16 at 19:41
  • 5
    I had never even heard of Baudot code before you mentioned this. Unless you are working on a VERY old Western Union system, this probably won't be a problem. But this goes to my point that the standard would allow this, and the assumption the the Latin alphabet is sequenced would get you into trouble here. – ad absurdum Oct 07 '16 at 19:57
  • 1
    @RSahu Why would `strcmp` be a problem? What _guaranteed by the standard_ behavior are you expecting from `strcmp` that would be hard to fulfill on such a system? (As far as I remember, there's no guarantee on the sign of `strcmp("abc", "auc")` besides that it match the sign of `'b' - 'u'`.) – mtraceur Oct 21 '21 at 08:02
  • @mtraceur, I see your point. The challenge won't be challenged but the user will be challenged. We have come to expect, falsely apparently, `strcmp` to perfom a lexicographical comparison. – R Sahu Oct 21 '21 at 17:25
6

You could probably just make a table for the characters the standard garantees there will be to ASCII character numbers. E.g.,

#include <limits.h>
static char mytable[] = {
  ['a'] = 0x61,
  ['b'] = 0x62,
  // ...
  ['A'] = 0x41,
  ['B'] = 0x42,
  // ...
};

The compiler will map every characters in the current character set (which may be any crazy character set) to ASCII codes, and the characters which are not garanteed to exist will be mapped to zero. Then you can use this table for ordering whenever needed.

As you said,

char c1;
char c2;

Could portably be verified to be alphabetically ordered by checking

(c1 < sizeof(mytable) && c2 < sizeof(mytable) ? mytable[c1] < mytable[c2] : 0)

I've actually used this on a research project which runs on ASCII and EBCDIC for predictable ordering, but it's portable enough to work on any character set. Edit: I've actually let the size of the table empty, so that it would compute to the minimum needed, because of the DeathStation 9000, on which a byte might have 32bits and hence CHAR_MAX be up to 4294967295 or greater.

paulotorrens
  • 2,286
  • 20
  • 30
  • 1
    Nice idea to use `['a'] = 0x61,` initialization. But I suggest `mytable[CHAR_MAX + 1 - CHAR_MIN]` and `['a' - CHAR_MIN] = 0x61,` instead (and then use it like `mytable[ch - CHAR_MIN]`) to cope with the pesky negative valued `char`. – chux - Reinstate Monica Oct 07 '16 at 20:08
  • If `char` is signed, then `CHAR_MAX + 1 - CHAR_MIN` should result in zero. Not a good idea. – paulotorrens Oct 07 '16 at 20:09
  • 1
    Hmmm 127 + 1 - -128 --> 256 – chux - Reinstate Monica Oct 07 '16 at 20:10
  • Please note that on _some systems_, you might run with a problem by using `CHAR_MAX` (or `CHAR_MAX + 1 - CHAR_MIN`) as the table size. I've seem C compilers for inhospitable systems (e.g., Lisp Machines and the JVM) on which `sizeof(char) = sizeof(short) = sizeof(int) = sizeof(long)`, and thus `CHAR_BIT = 32`, which is perfectly fine by the C standard. You could leave the size empty, and let the compiler calculate it, which will most certainly result in a table with fewer than 256 bytes. I've edited my answer. – paulotorrens Oct 07 '16 at 20:23
  • 2
    You'll get an out-of-bounds array access if you try to compare a non-letter character (specifically one whose value exceeds that of any letters, such as `'~'` in ASCII). You'll probably want to wrap this in a function that checks `isalpha()` before accessing the array. (Be careful, `isalpha()` can return true for locale-specific characters.) – Keith Thompson Oct 07 '16 at 20:41
  • What the heck is this `['a'] = 0x61` syntax? Is this even C? – cat Oct 08 '16 at 00:20
  • 2
    @cat: I think it's called designated initializer, unless that's only for struct members (like `.x = 1, .y = 4`). Yep, just found https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html, and the array version falls under the same name. As a GNU extension, you can even specify ranges to be assigned the same value. – Peter Cordes Oct 08 '16 at 00:21
  • @PeterCordes I [found it](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html), but it's still [bizzare and baffling](https://stackoverflow.com/questions/18731707/why-does-c11-not-support-designated-initializer-list-as-c99) to me – cat Oct 08 '16 at 00:25
  • Yep, it's C, it's called designated initializers, [and it's standard, and portable](http://en.cppreference.com/w/c/language/array_initialization). – paulotorrens Oct 08 '16 at 01:44
4

For A-Z,a-z in a case-insensitive manner (and using compound literals):

char ch = foo();
az_rank = strtol((char []){ch, 0}, NULL, 36);

For 2 char that are known to be A-Z,a-z but may be ASCII or EBCDIC.

int compare2alpha(char c1, char c2) {
  int mask = 'A' ^ 'a';  // Only 1 bit is different between upper/lower
  return (c1 | mask) - (c2 | mask);
}

Alternatively, if limited to 256 differ char, could use a look-up table that maps the char to its rank. Of course the table is platform dependent.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • This looks like the trick @user3386109 linked to... very interesting. As for the look-up table, that is sort of what I am asking: for portability do we have to resort to ordering the characters ourselves? – ad absurdum Oct 07 '16 at 19:52
  • @DavidBowling "do we have to resort to ordering the characters ourselves?" A-Z has a defined order and if sticking to A-Z, then assuming all real codings will always use an increasing order is reasonable. Should you want a case sensitive ordering, you no longer have universal agreement if lower < or > upper case. If you want more than A-Z,a-z, there is less universal agreement. Else just convert to a Unicode code-point and compare. – chux - Reinstate Monica Oct 07 '16 at 19:58
  • But I don't see that, e.g, A-Z has a defined order in the standard, so it seems that we are just making that assumption. So we are relying on the fact that no one would map the characters to numbers any other way. That is, it seems that we are relying on convention here. – ad absurdum Oct 07 '16 at 20:09
  • 3
    The 1-bit difference between cases in both ASCII and EBCDIC is an interesting and useful observation - but not really in the C specifications, correct? – Jongware Oct 08 '16 at 00:28
  • @RadLexus The 1 bit difference is specified in ASCII and EBCDIC. It is _reasonable_, though not C specified, to assert all A-Z,a-z coding s _ever_ used by C will share this property. – chux - Reinstate Monica Oct 09 '16 at 03:22
3

With C11, code could use _Static_assert() to insure, at compile time, that characters have a desired ordering.

An advantage to this approach is that since the overwhelming character codings all ready meet the desired A-Z requirement, should a novel or esoteric platform use something different, it may require a coding or customization that is not foreseeable. This best code can do in that case is to fail to compile.

Example use

// Sample case insensitive string sort routine that insures 
// 1) 'A' < 'B' < 'C' < ... < 'Z'
// 2) 'a' < 'b' < 'c' < ... < 'z'

int compare_string_case_insensitive(const void *a, const void *b) {
  _Static_assert('A' < 'B', "A-Z order unexpected");
  _Static_assert('B' < 'C', "A-Z order unexpected");
  _Static_assert('C' < 'D', "A-Z order unexpected");
  // Other 21  _Static_assert() omitted for brevity
  _Static_assert('Y' < 'Z', "A-Z order unexpected");


  _Static_assert('a' < 'b', "a-z order unexpected");
  _Static_assert('b' < 'c', "a-z order unexpected");
  _Static_assert('c' < 'd', "a-z order unexpected");
  // Other 21  _Static_assert() omitted for brevity
  _Static_assert('y' < 'z', "a-z order unexpected");

  const char *sa = (const char *)a;
  const char *sb = (const char *)b;
  int cha, chb;
  do {
    cha = toupper((unsigned char) *sa++);
    chb = toupper((unsigned char) *sb++);
  } while (cha && cha == chb);

  return (cha > chb) - (cha < chb);
}
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256