Detect if execution character set letters are contiguous

Question

In the code, a switch is used to convert letters to contiguous values. Optimizing compilers in general won't do the job as well as the simple contiguous digit condition right before the switch. How can I detect which execution character set used and/or conclude that the letters are contiguous to replace it with simple conditionals ?

static long digit_value(char c)
{
    if (c >= '0' && c <= '9')
        return (c-'0');

    switch(c) {
    case 'a':
    case 'A':
        return 10;
    case 'b':
    case 'B':
        return 11;
    case 'c':
    case 'C':
        return 12;
    case 'd':
    case 'D':
        return 13;
    case 'e':
    case 'E':
        return 14;
    case 'f':
    case 'F':
        return 15;
    case 'g':
    case 'G':
        return 16;
    case 'h':
    case 'H':
        return 17;
    case 'i':
    case 'I':
        return 18;
    case 'j':
    case 'J':
        return 19;
    case 'k':
    case 'K':
        return 20;
    case 'l':
    case 'L':
        return 21;
    case 'm':
    case 'M':
        return 22;
    case 'n':
    case 'N':
        return 23;
    case 'o':
    case 'O':
        return 24;
    case 'p':
    case 'P':
        return 25;
    case 'q':
    case 'Q':
        return 26;
    case 'r':
    case 'R':
        return 27;
    case 's':
    case 'S':
        return 28;
    case 't':
    case 'T':
        return 29;
    case 'u':
    case 'U':
        return 30;
    case 'v':
    case 'V':
        return 31;
    case 'w':
    case 'W':
        return 32;
    case 'x':
    case 'X':
        return 33;
    case 'y':
    case 'Y':
        return 34;
    case 'z':
    case 'Z':
        return 35;
    default:
        break;
    }

    return -1;
}

Related/context: https://stackoverflow.com/questions/13141776/is-the-character-set-of-a-char-literal-guaranteed-to-be-ascii — mkrieger1, Jul 20 '20 at 21:33
Even though the language doesn't require it, unless you need to be portable to an EBCDIC environment you can count on letters being sequential. — Barmar, Jul 20 '20 at 21:43
Would it be enough to just exclude EBCDIC? (that for me is some sort of legend, as I never worked with such charset). — Roberto Caboni, Jul 20 '20 at 21:51
Character sets may have gaps, but AFAIR they need to be sorted. Wouldn't a check `'z'-'a'==25` do the trick for detecting gaps? — Gerhardh, Jul 21 '20 at 06:37
I've seen modern GCC and Clang turn switches with lots of contiguous entires into array lookups, so frankly if your compiler can't do this, my first relfex is "yes it can, you just didn't turn up the optimization settings high enough", and my second reflex is "ok, maybe it can't, but realistically, almost all situations are either ones where you should just not optimize except for clearly expressing your intent to keep the bar as low as possible for compiler optimizations to catch up, or you can just assume a character set in your niche performance-critical use-case." — mtraceur, Oct 21 '21 at 07:55
Having said all that, I think the question is valid and I strongly relate to the motivation behind it. — mtraceur, Oct 21 '21 at 07:55

chux - Reinstate Monica · Accepted Answer · 2020-07-20T22:34:43.660

How can I detect which execution character set used and/or conclude that the letters are contiguous?

At compile time, simply test them all. ('a-z' left out for simplicity)

static_assert(
  'A' == ('B' - 1) &&
  'B' == ('C' - 1) && 'C' == ('D' - 1) && 'D' == ('E' - 1) && 'E' == ('F' - 1) && 'F' == ('G' - 1) && 'G' == ('H' - 1) && 'H' == ('I' - 1) && 'I' == ('J' - 1) && 'J' == ('K' - 1) && 'K' == ('L' - 1) && 'L' == ('M' - 1) && 'M' == ('N' - 1) && 'N' == ('O' - 1) && 'O' == ('P' - 1) && 'P' == ('Q' - 1) && 'Q' == ('R' - 1) && 'R' == ('S' - 1) && 'S' == ('T' - 1) && 'T' == ('U' - 1) && 'U' == ('V' - 1) && 'V' == ('W' - 1) && 'W' == ('X' - 1) && 'X' == ('Y' - 1) &&
  'Y' == ('Z' - 1), "Dinosaur: not continuous A-Z");

static int digit_value(char c) {
  if (c >= '0' && c <= '9') return c - '0';
  if (c >= 'A' && c <= 'Z') return c - 'A' + 10;
  return -1;
}

Other dinosaur tests.

Or use the slow, but highly portable:

static int digit_value(char c) {
  static const char *base36 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  const char *p = strchr(base36, (unsigned char) c);
  if (p && *p) {
    return (int) (p - base36);
  }
  return -1;
}

Or perhaps a big #if?

#if ('A' == ('B' - 1) && 'B' == ('C' - 1) && 'C' == ('D' - 1) && 'D' == ('E' - 1) && 'E' == ('F' - 1) && 'F' == ('G' - 1) && 'G' == ('H' - 1) && 'H' == ('I' - 1) && 'I' == ('J' - 1) && 'J' == ('K' - 1) && 'K' == ('L' - 1) && 'L' == ('M' - 1) && 'M' == ('N' - 1) && 'N' == ('O' - 1) && 'O' == ('P' - 1) && 'P' == ('Q' - 1) && 'Q' == ('R' - 1) && 'R' == ('S' - 1) && 'S' == ('T' - 1) && 'T' == ('U' - 1) && 'U' == ('V' - 1) && 'V' == ('W' - 1) && 'W' == ('X' - 1) && 'X' == ('Y' - 1) && 'Y' == ('Z' - 1))

static int digit_value(char c) {
  if (c >= '0' && c <= '9') return c - '0';
  if (c >= 'A' && c <= 'Z') return c - 'A' + 10;
  return -1;
}

#else

static int digit_value(char c) {
  static const char *base36 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  const char *p = strchr(base36, (unsigned char) c);
  if (p && *p) {
    return (int) (p - base36);
  }
  return -1;
}

#endif

Or .... if UCHAR_MAX not too big and concern about speed, make a lookup table and skip the sequential concerns.

#include <limits.h>
int digit_value(char c) {
  unsigned char val[UCHAR_MAX] = {['0'] = 1, ['1'] = 2, ['2'] = 3, ['3'] = 4,
      ['4'] = 5, ['5'] = 6, ['6'] = 7, ['7'] = 8, ['9'] = 10, 
      ['A'] = 11, ['B'] = 12, ['C'] = 13, ['D'] = 14, ['E'] = 15, ...
      ['a'] = 11, ['b'] = 12, ['c'] = 13, ['d'] = 14, ['e'] = 15, ...
  };
  return val[(unsigned char) c] - 1;
}

The code must allow any kind of character set, but static_assert prohibits non-contiguous ones. — hdante, Jul 20 '20 at 22:03
@hdante Simply use `if ('A' == ('B' - 1) && 'B' == ('C' - 1) ... 'Y' == ('Z' - 1))` and 1st `digit_value()`, else use the 2nd `digit_value(char c)` with its `strchr()`. The compile will optimize. — chux - Reinstate Monica, Jul 20 '20 at 22:09
@chux-ReinstateMonica nice, I don't know if I'm happy or sad, but I'll try it — hdante, Jul 20 '20 at 23:07
@EricPostpischil True `'B' == 'A'+1` may be clearer. Pedantic concern: given A-Z cannot be a _null character_, `'A' == 'B'-1` cannot overflow (execution character set letters are positive) , yet _in theory_, `some_letter+1` can `intmax_t` overflow. And we are making code to detect _strange_ implementations. — chux - Reinstate Monica, Jul 21 '20 at 01:16
My only concern, as I wrote in my answer, is that `static_assert` assert is a "new" feature of the standard, and it will unlikely supported in an old system. — Roberto Caboni, Jul 21 '20 at 06:38
@RobertoCaboni For C99, alternatives readily exist: [Is there a static_assert replacement which satisfies the C99 standard?](https://stackoverflow.com/q/20425938/2410359) Common to emulate with a failure to for an array of size -1. — chux - Reinstate Monica, Jul 21 '20 at 09:24
@chux, it was just a side concern. Anyway the ` #if` chain would do the trick as well. — Roberto Caboni, Jul 21 '20 at 09:26

John Bollinger · Answer 2 · 2020-07-20T22:34:36.533

You can write an appropriate test for conditional compilation, as @chux answered. However, if you need to support character sets with non-contiguous letters, then you need an implementation that supports that, and a table lookup would be a lot more efficient than what you present in the question. So much more so that you could consider using it for all cases instead of maintaining two separate implementations. For example:

static long digit_value(char c) {
    static const long dvs[UCHAR_MAX + 1] = {
      [0] = 1, [1] = 2, [2] = 3, [3] = 4, [5] = 6, [7] = 8, [8] = 9, [9] = 10,
      ['A'] = 11, ['B'] = 12, ['C'] = 13, ['D'] = 14, ['E'] = 15, ['F'] = 16, ['G'] = 17,
      ['H'] = 18, ['I'] = 19, ['J'] = 20, ['K'] = 21, ['L'] = 22, ['M'] = 23, ['N'] = 24,
      ['O'] = 25, ['P'] = 26, ['Q'] = 27, ['R'] = 28, ['S'] = 29, ['T'] = 30, ['U'] = 31,
      ['V'] = 32, ['W'] = 33, ['X'] = 34, ['Y'] = 35, ['Z'] = 36,
      ['a'] = 11, ['b'] = 12, ['c'] = 13, ['d'] = 14, ['e'] = 15, ['f'] = 16, ['g'] = 17,
      ['h'] = 18, ['i'] = 19, ['j'] = 20, ['k'] = 21, ['l'] = 22, ['m'] = 23, ['n'] = 24,
      ['o'] = 25, ['p'] = 26, ['q'] = 27, ['r'] = 28, ['s'] = 29, ['t'] = 30, ['u'] = 31,
      ['v'] = 32, ['w'] = 33, ['x'] = 34, ['y'] = 35, ['z'] = 36
    };

    return dvs[(unsigned char) c] - 1;
}

Note that the characters in the basic execution character set, which include all the decimal digits and upper- and lower-case Latin letters, are guaranteed to have non-negative integer values. That somewhat simplifies the initializer for the table. Note also that elements that are not explicitly initialized will be default-initialized to zero, which eventually gets converted to a -1 return value by subtracting 1 from the tabulated value.

Perhaps, @chux. I simply matched the table element type with the return type (which in turn I took from the OP's code), knowing that the table will then take up more space than it needs to do. I'm uncertain about the performance implications of the choice of table element type in the OP's target environments, but it is possible that changing it to `char` would provide both a size advantage and a speed advantage. — John Bollinger, Jul 21 '20 at 00:15

Roberto Caboni · Answer 3 · 2020-07-21T09:57:30.003

First of all I have to say that I would choose the preprocessor solution, as the charset of the compiler would be an information I'd like to discover at compile time.

The _static_assert would be an elegant solution, but since it has been introduced only with C11 standard, real dinosaurs are unlikely to support it.

I would perform the check with common preprocessor directives that is with a chain of #ifs each making sure that the representation of one char is contiguous to the one of the previous char.

A simple runtime check

The compiling time check has been well covered by the other answers, so I just want to suggest a simple runtime way for excluding EBCDIC charset: the assumption is a scenario in which the charset can either be EBCDIC or ASCII.

In this way, by excluding EBCDIC we can assume that the ASCII charset is used, so the characters are contiguous.

Just two simple EBCDIC features have to be considered:

The letters that have a non-contiguous decimal representations are well known (for example 'j' != 'i'+1)
The alphabetical chars should have common decimal representations in all hundreds EBCDIC variants, as recommended in this IBM page: invariant charset

So:

static long digit_value(char c)
{
    if (c >= '0' && c <= '9')
        return (c-'0');

    if ( 'j' == 'i'+1 ) //if it's not EBCDIC 
    {
        if( c >= 'a' && c <= 'z' )
        {
             return 10 + c - 'a';
        }
        else if ( c >= 'A' && c <= 'Z' )
        {
            return 10 + c - 'A';
        }
    }

    return -1;
}

An error code is returned in case of EBCDIC charset, and in case of ASCII charset a simple conditional make sure that the range [10 - 35] is returned for both lower case and upper case characters.

0___________ · Answer 4 · 2020-07-20T23:31:22.787

To be portable and most efficient use the lookup table

const char table[128] = {
                         ['0'] = 0,  ['1'] = 1, /* ... */ ['9'] = 9, 
                         ['a'] = 10, ['A'] = 10,
                         ['b'] = 11, ['B'] = 11,
                         ['c'] = 12, ['C'] = 12,
                         ['d'] = 13, ['D'] = 13,
                         ['e'] = 14, ['E'] = 14,
                         ['f'] = 15, ['f'] = 15,
                         ['g'] = 16, ['g'] = 16,
                         ['h'] = 17, ['H'] = 17,
                         ['i'] = 18, ['I'] = 18,
                         ['j'] = 19, ['J'] = 19,
                         /* etc etc */
};

int get_value(char ch)
{
    return table[ch & 0x7f];
}

and the generated code:

get_value:
        movsx   rdi, dil
        movsx   eax, BYTE PTR table[rdi]
        ret

Typos in `['i'] = 18, ['J'] = 18,` and others. Suggest `table[256]` or the like and negative index protection in `return table[ch];` — chux - Reinstate Monica, Jul 20 '20 at 22:39

Detect if execution character set letters are contiguous

4 Answers4

A simple runtime check