28

I'm looking at the following code to test for a hexadecimal digit and convert it to an integer. The code is kind of clever in that it takes advantage of difference between between capital and lower letters is 32, and that's bit 5. So the code performs one extra OR, but saves one JMP and two CMPs.

static const int BIT_FIVE = (1 << 5);
static const char str[] = "0123456789ABCDEFabcdef";

for (unsigned int i = 0; i < COUNTOF(str); i++)
{
    int digit, ch = str[i];

    if (ch >= '0' && ch <= '9')
        digit = ch - '0';
    else if ((ch |= BIT_FIVE) >= 'a' && ch <= 'f')
        digit = ch - 'a' + 10;
    ...
}

Do C and C++ guarantee the ASCII or values of [a-f] and [A-F] characters? Here, guarantee means the upper and lower character sets will always differ by a constant value that can be represented by a bit (for the trick above). If not, what does the standard say about them?

(Sorry for the C and C++ tag. I'm interested in both language's position on the subject).

jww
  • 97,681
  • 90
  • 411
  • 885
  • Make that array static! – Neil Kirk Apr 01 '15 at 00:50
  • 10
    Consider EBCDIC. – chris Apr 01 '15 at 00:53
  • 3
    What does "guarantee" mean in "guarantee the ASCII or values of [a-f] and [A-F]". Guarantee what? Existence? – c-smile Apr 01 '15 at 00:55
  • 3
    @c-smile That they are contiguous and the difference between caps is 32. – Neil Kirk Apr 01 '15 at 00:56
  • 1
    C at least only guarantees that the character set values of a-f and A-F are sequential. – teppic Apr 01 '15 at 00:57
  • 4
    @teppic: Are you sure? I'm looking at n1516 section 5.2.1 right now and it doesn't say that, it only guarantees that 0-9 are sequential. Indeed A-F and a-f are sequential in both EBCDIC and ASCII, however. – Dietrich Epp Apr 01 '15 at 00:58
  • @DietrichEpp I thought it was all of the standard characters but that may well be wrong, I haven't any reference that says that to hand. (Since it doesn't mention the alphabet characters in that section it's pretty safe to assume it is only numbers). – teppic Apr 01 '15 at 01:01
  • 1
    @teppic: No, there is no such guarantee. – Keith Thompson Apr 01 '15 at 01:06
  • @DietrichEpp: [N1570](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf) is a more current draft of the C standard. – Keith Thompson Apr 01 '15 at 01:07
  • Note: I once had a [very tight code](http://stackoverflow.com/questions/17070973/how-to-reduce-the-code-space-for-a-hexadecimal-ascii-chars-conversion-using-a-s) space to code `char --> int`. A small mod make it case insensitive. – chux - Reinstate Monica Apr 01 '15 at 01:16
  • I think `if ((ch |= ('a' ^ 'A') >= ('a' | ('a' ^ 'A')) && ch <= ('f' | ('a' ^ 'A')))` works for both ASCII and EBCDIC. – chux - Reinstate Monica Apr 01 '15 at 01:34
  • 2
    AFAICT there's not even a guarantee that `a` < `b`. But that gets us in truly evil territory. – MSalters Apr 01 '15 at 08:03
  • @KeithThompson: The relevant section is largely the same in both. – Dietrich Epp Apr 01 '15 at 08:29
  • 1
    Related: http://stackoverflow.com/questions/9416926/are-the-character-digits-0-9-required-to-have-contiguous-numeric-values – Sebastian Mach Apr 01 '15 at 09:42
  • 1
    @MSalters There *is* in fact a guarantee that `'a' < 'b' < 'c' < ... < 'z'` and `'A' < 'B' < 'C' < ... < 'Z'`. However, the relative ordering of `'a'` and `'A'` is unspecified (and different between ASCII and EBCDIC). – zwol Apr 01 '15 at 14:14
  • 1
    @zwol: Which part of the standard gives us the guarantee? The listing of the members of the basic character set is not a prescriptive ordering. – MSalters Apr 01 '15 at 14:19
  • @MSalters I am 100% certain that this was an explicitly stated requirement in C89 *somewhere*. However, I am failing to find it in N1570 right now, and that's the only version I have on this computer. `¯\_(ツ)_/¯` – zwol Apr 01 '15 at 14:25
  • @zwol: There is in fact no such guarantee in any edition of the C standard. A conforming implementation can have `'a' > 'z'`. (I just checked my copy of the C90 standard, section 5.2.1.) – Keith Thompson Apr 01 '15 at 18:32
  • @KeithThompson Like I say, I am 100% certain that it is in there *somewhere*. It clearly never was in 5.2.1 and I still can't find it, though. Maybe it's just implicit in the behavioral requirements for `strcoll` in the "C" locale? (Which are distressingly underspecified.) – zwol Apr 01 '15 at 19:38
  • @zwol: I'm about 99% certain it isn't. [This](http://www.bsb.me.uk/ansi-c/ansi-c) is a pre-C89 draft of the ANSI C standard; if the ANSI C standard makes that guarantee it should be in there. I see nothing in the description of `strcoll` that implies such a guarantee. If you're right, I eagerly look forward to a citation that demonstrates it (and I'll be glad to admit that I'm wrong). – Keith Thompson Apr 01 '15 at 19:48

3 Answers3

40

No, it does not.

The C standard guarantees that the decimal digits and uppercase and lowercase letters exist, along with a number of other characters. It also guarantees that the decimal digits are contiguous, for example '0' + 9 == '9', and that all members of the basic execution character set have non-negative values. It specifically does not guarantee that the letters are contiguous. (For all the gory details, see the N1570 draft of the C standard, section 5.2.1; the guarantee that basic characters are non-negative is in 6.2.5p3, in the discussion of type char.)

The assumption that 'a' .. 'f' and 'A' .. 'F' have contiguous codes is almost certainly a reasonable one. In ASCII and all ASCII-based character sets, the 26 lowercase letters are contiguous, as are the 26 uppercase letters. Even in EBCDIC, the only significant rival to ASCII, the alphabet as a whole is not contiguous, but the letters 'a' ..'f' and 'A' .. 'F' are (EBCDIC has gaps between 'i' and 'j', between 'r' and 's', between 'I' and 'J', and between 'R' and 'S').

However, the assumption that setting bit 5 of the representation will convert uppercase letters to lowercase is not valid for EBCDIC. In ASCII, the codes for the lowercase and uppercase letters differ by 32; in EBCDIC they differ by 64.

This kind of bit-twiddling to save an instruction or two might be reasonable in code that's part of the standard library or that's known to be performance-critical. The implicit assumption of an ASCII-based character set should IMHO at least be made explicit by a comment. A 256-element static lookup table would probably be even faster at the expense of a tiny amount of extra storage.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • 6
    If a 100% guarantee is required, an `#if` could be used to verify those letters are contiguous and trigger an error if they are not. Better to fail at compile-time than run-time. – Boann Apr 01 '15 at 01:03
  • +1 mention that this code doesn't solve any actual problem and even a non-optimal version of this code probably isn't the bottleneck of the actual problem OP is solving... A profiler should be used to avoid targeting the wrong optimisations, since targeting the wrong optimisations can be harmful. – autistic Apr 01 '15 at 01:20
  • Also, a link to the standard would make this that much more credible. http://www.iso-9899.info/n1570.html#5.2.1 – autistic Apr 01 '15 at 01:23
  • 7
    @Boann: (Correction to my previous comment) No, that's not guaranteed to work. The standard specifically says that `#if 'z' - 'a' == 25` does not necessarily evaluate the same way as `if ('z' - 'a' == 25)`; it can use different character sets at compile time vs. at run time. [N1570](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf) 6.10.1. – Keith Thompson Apr 01 '15 at 01:56
  • 1
    @KeithThompson: That's just evil. Would using it in a `static_assert` or other compile-time constant expression work okay? – Cornstalks Apr 01 '15 at 02:19
  • 2
    One would have to check all characters. In theory, there could be a frenzy character set where a..z are not contiguous, but where z has accidentally the same distance to a. – Sebastian Mach Apr 01 '15 at 09:44
  • @Keith, I liked your answer a bit better than the others. Everyone who answered told me *NO*, and that was the thrust of it. So I selected the first answer. My apologies to you. – jww Apr 01 '15 at 10:12
  • @Boann - better to not fail at all. Write portable code. – Pete Becker Apr 01 '15 at 11:54
  • @phresnel: Wouldn't be that crazy. Dutch has something like that (it replaces y `U+0079` with ij `U+0133`). – MSalters Apr 01 '15 at 14:24
  • 1
    @jww you do realise that you can change your accepted answer? – Hong Ooi Apr 01 '15 at 14:56
  • @MSalters: I wonder whether that means that such a character set isn't a crazy thing, or that Dutch people are crazy ;-P – ninjalj Apr 01 '15 at 18:13
  • 1
    @HongOoi: Yes, but then I wouldn't have gotten the Populist badge! 8-)} – Keith Thompson Apr 01 '15 at 22:37
  • I love the dutch. I would never admit, but their beer is nice :D Living just 2km of the border :) – Sebastian Mach Apr 04 '15 at 16:44
26

There are no guarantees about the particular values but you shouldn't care, because your software will probably never encounter a system which is not compatible in this way with ASCII. Assume that space is always 32 and that A is always 65, this works fine in the modern world.

The C standard only guarantees that letters A-Z and a-z exist and that they fit within a single byte.

It does guarantee that 0-9 are sequential.

In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.

Justification

There are a lot of character encodings out in the world. If you care about portability, you can either make your program portable to different character sets, or you can choose one character set to use everywhere (e.g. Unicode). I'll go ahead and loosely categorize most existing character encodings for you:

  1. Single byte character encodings compatible with ISO/IEC 646. Digits 0-9 and letters A-Z and a-z always occupy the same positions.

  2. Multibyte character encodings (Big5, Shift JIS, ISO 2022-based). In these encodings, your program is probably already broken and you'll need to spend time fixing it if you care. However, parsing numbers will still work as expected.

  3. Unicode encodings. Digits 0-9 and letters A-Z, a-z always occupy the same positions. You can either work with code points or code units freely and you will get the same result, if you are working with code points below 128 (which you are). (Are you working with UTF-7? No, you should only use that for email.

  4. EBCDIC. Digits and letters are assigned different values than their values in ASCII, however, 0-9 and A-F, a-f are still contiguous. Even then, the chance that your code will run on an EBCDIC system is essentially zero.

So the question here is: Do you think that a hypothetical fifth option will be invented in the future, somehow less compatible / more difficult to use than Unicode?

Do you care about EBCDIC?

We could dream up bizarre systems all day... suppose CHAR_BIT is 11, or sizeof(long) = 100, or suppose we use one's complement arithmetic, or malloc() always returns NULL, or suppose the pixels on your monitor are arranged in a hexagonal grid. Suppose your floating-point numbers aren't IEEE 754, suppose all of your data pointers are different sizes. At the end of the day, this does not get us closer to our goals of writing working software on actual modern systems (with the occasional exception).

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • That's interesting (that only 0-9 are contiguous). That means the test `if (ch >= 'a' && ch <= 'f')` is not valid under obscure conditions. Is that implementation defined or undefined behavior? – jww Apr 01 '15 at 08:34
  • 10
    @jww: "undefined behavior" would mean that `ch >= 'a' && ch <= 'f'` could theoretically crash the program, reformat your hard drive, and spray paint your cat pink. "Implementation defined" means that it is up to the implementation. The implementation has a character encoding, and the choice of character encoding determines the answers to these questions. Note that nobody actually uses encodings where A-F are not contiguous, so nobody cares. – Dietrich Epp Apr 01 '15 at 08:51
  • 12
    Wow, way to teach someone not to think in abstractions. This is an _awful_ answer, no matter how coldly factually accurate it may be. – Lightness Races in Orbit Apr 01 '15 at 10:02
  • 6
    @DietrichEpp: please change *"but you shouldn't care"* to *"In practice, on the most common platforms encountered today,"*. Platform dependency is OK if you are aware of it (and ideally, static_assert on it). Your answer is good and to the point but the first sentence makes my toenails roll up which is not a nice sight at all. – peterchen Apr 01 '15 at 10:43
  • @jww The exact ordering and contents of the execution character set is implementation-defined but must conform to a small number of constraints, one of which is that 'a' through 'z' and 'A' through 'Z' must be in increasing order. However, unlike the digits, they are not necessarily consecutive. Therefore, `if (ch >= 'a' && ch <= 'f')` is guaranteed to be true for 'a', 'b', 'c', 'd', 'e', and 'f', *and may also be true for some other implementation-defined set of characters*. (In both ASCII and EBCDIC, no other characters intrude in that range.) – zwol Apr 01 '15 at 14:23
  • @zwol: yes, 'a' to 'f' are consecutive in EBCDIC, but not 'a' to 'z'. (I know the OP didn't aske that, but still) – ninjalj Apr 01 '15 at 17:52
  • @ninjalj I thought I was clear that "that range" referred to `ch >= 'a' && ch <= 'f'`. – zwol Apr 01 '15 at 18:03
  • 3
    @LightningRacisinObrit: I don't think the goal here is to teach people to think in abstractions when those abstractions are unnecessary. There is a cost to writing portable code, and the benefit to writing code that is portable to character sets which are neither ASCII nor EBCDIC compatible but some *hypothetical third option* is zero, unless someone can demonstrate otherwise. The benefit to EBCDIC portability is usually zero as well. – Dietrich Epp Apr 01 '15 at 19:45
  • @DietrichEpp: But we `isalpha()` for that test. The contiguous nature of `0 -> 9` allows easy converting to integer by a single subtraction. – Martin York Apr 01 '15 at 21:16
  • Not sure I about the "Don't care". There are enough standard functions to test for most characteristics that you want to know about. – Martin York Apr 01 '15 at 21:17
  • 4
    @LightningRacisinObrit: I have a hard time understanding how you could even think that writing portable code has no cost compared to writing non-portable code... usually, you seem pretty savvy. – Dietrich Epp Apr 02 '15 at 05:56
  • 2
    "There is no cost to writing portable code" is just obviously incorrect. Portability is a constraint, and additional constraints increase project costs. You didn't bother to justify the statement, I would like to see what your justification is. Perhaps you were not speaking in general but speaking about some specific case, but then again, the statement wasn't qualified and I didn't see any obvious implicit qualification. – Dietrich Epp Apr 02 '15 at 10:10
  • 1
    Standard answer to anybody suggesting to always support non-ASCII systems: so you’re using digraphs (or even trigraphs) in your source code, consistently, all the time? – Holger Feb 21 '20 at 15:23
24

For maximum portability, clarity and speed, I would suggest a simple switch:

int hex_digit_value(char x)
{
    switch (x)
    {
    case '0': return 0;
    case '1': return 1;
    case '2': return 2;
    case '3': return 3;
    case '4': return 4;
    case '5': return 5;
    case '6': return 6;
    case '7': return 7;
    case '8': return 8;
    case '9': return 9;
    case 'A':
    case 'a': return 10;
    case 'B':
    case 'b': return 11;
    case 'C':
    case 'c': return 12;
    case 'D':
    case 'd': return 13;
    case 'E':
    case 'e': return 14;
    case 'F':
    case 'f': return 15;
    default: return -1;
    }
}

clang -O1 -S transforms this into a simple table lookup:

    addl    $-48, %edi
    cmpl    $54, %edi
    ja  .LBB0_2

    movslq  %edi, %rax
    movl    .Lswitch.table(,%rax,4), %eax
    retq
.LBB0_2:
    movl    $-1, %eax
    retq

For completeness, here is the generated lookup table:

.Lswitch.table:
.long   0                       # 0x0
.long   1                       # 0x1
.long   2                       # 0x2
.long   3                       # 0x3
.long   4                       # 0x4
.long   5                       # 0x5
.long   6                       # 0x6
.long   7                       # 0x7
.long   8                       # 0x8
.long   9                       # 0x9
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   10                      # 0xa
.long   11                      # 0xb
.long   12                      # 0xc
.long   13                      # 0xd
.long   14                      # 0xe
.long   15                      # 0xf
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   4294967295              # 0xffffffff
.long   10                      # 0xa
.long   11                      # 0xb
.long   12                      # 0xc
.long   13                      # 0xd
.long   14                      # 0xe
.long   15                      # 0xf
fredoverflow
  • 256,549
  • 94
  • 388
  • 662