Why is the alphabet split into multiple ranges in this C code?

Question

In a custom library I saw an implementation:

inline int is_upper_alpha(char chValue)
{
    if (((chValue >= 'A') && (chValue <= 'I')) ||
        ((chValue >= 'J') && (chValue <= 'R')) ||
        ((chValue >= 'S') && (chValue <= 'Z')))
        return 1;
    return 0;
}

Is that an Easter egg or what are the advantages vs standard C/C++ method?

inline int is_upper_alpha(char chValue)
{
    return ((chValue >= 'A') && (chValue <= 'Z'));
}

Note that in EBCDIC, the character range for lower-case letters comes before the character range for upper-case letters, and both come before the digits — which is exactly the opposite of the order in ASCII-based encodings (such as the 8859-x series, or Unicode, or CP1252, or …). — Jonathan Leffler, May 06 '15 at 13:44
Note: if `'J' - 'I'` and `'S' - 'R'` both equal `1`, then I expect that a reasonable optimizer would turn the former in the latter. — Matthieu M., May 06 '15 at 14:44

Wintermute · Accepted Answer · 2015-05-05T10:14:58.160

215

The author of this code presumably had to support EBCDIC at some point, where the numeric values of the letters are non-contiguous (gaps exist between I, J and R, S, as you may have guessed).

It is worth noting that the C and C++ standards only guarantee that the characters 0 to 9 have contiguous numeric values for precisely this reason, so neither of these methods is strictly standard-conforming.

edited May 05 '15 at 10:14

answered May 05 '15 at 10:08

Wintermute

42,983
5
77
80

1

Yes, This is sure that author want's to support EBCDIC 037 code. to check EBCDIC codes please refer the link http://en.wikipedia.org/wiki/EBCDIC_037 – Mohit Thakur May 05 '15 at 10:17
1

Yes you are right. The method is implemented for the non-contiguous letters in EBCDIC. Thanks for the answer! – Vladimir Ch. May 05 '15 at 10:28
64

The real WTF is why didn't the original author put in a comment: `// In the EBCDIC coding, the alphabet has gaps between these values. See URL: xxxx for details`. Then you'd never even have to ask the question. You'd have the answer built-in to the code. – abelenky May 05 '15 at 15:12
66

@abelenky If the code was originally for a system where ebcdic is normally used it may have seemed obvious at the time and not needed a comment, unfortunately things that seem fine in legacy code seem strange now. – Vality May 05 '15 at 15:57
26

@abelenky: The *real* WTF is why didn't the original author use standard functionality, i.e. `return ( isalpha( chValue ) && isupper( chValue ) )`... – DevSolar May 06 '15 at 08:12
Does any machine that uses EBCDIC have a C++ compiler at all? To my knowledge, no single computer built after ~1970 uses this... :-) – Damon May 06 '15 at 08:26
4

@Damon: That is not the issue. You might have to *process* an "alien" encoding even on a system that doesn't use that encoding natively. So you set your locale to the given encoding, and then you have to keep your fingers crossed that the programmer actually used standard functions instead of doing "smart" coding like the above, thinking he knows every encoding his program will ever encounter... – DevSolar May 06 '15 at 09:53
6

If it was written to support EBCDIC from the 1970's, was isalpha and isupper even ANSI or supported by majority of compilers back then? – nickalh May 06 '15 at 11:45
1

@abelenky not really; it's clearly depending upon ranges that happen to exist in the encoding(s) in use. It's certainly no more of a WTF than then second piece of code in the question. – Jon Hanna May 06 '15 at 14:19
@Damon: I believe IBM mainframes do still use EBCDIC, at least in compatibility modes but probably by default. Your cutoff date is at least 30 years premature, and probably more than that. – Jonathan Leffler May 06 '15 at 16:54
4

@DevSolar: Actually `isalpha` is wrong; its results are locale-specific and meant for processing natural language in the user's configured locale, whereas the actual need for most software is to match a fixed set of characters independent of locale. – R.. GitHub STOP HELPING ICE May 07 '15 at 06:44
2

@R.: In my experience, the *actual need* for most software is to match "word contents", or similar, and the programmer simply forgot about locale issues completely... in either case, a comment would do loads of good. ;-) – DevSolar May 07 '15 at 07:09

score 54 · Answer 2 · answered May 05 '15 at 10:08

54

Looks like it attempts to cover both EBCDIC and ASCII. Your alternative method doesn't work for EBCDIC (it has false positives, but no false negatives)

C and C++ do require that '0'-'9' are contiguous.

Note that the standard library calls do know whether they run on ASCII, EBCDIC or other systems, so they're more portable and possibly more efficient.

answered May 05 '15 at 10:08

MSalters

173,980
10
155
350

5

`std::isupper` actually queries the currently installed global C locale. – Lingxi May 05 '15 at 10:21
1

Yes, you are right. The method is written for cover both of encodings. Thanks for the answer! – Vladimir Ch. May 05 '15 at 10:26
4

@Lingxi: True, but that doesn't mean you can switch the locale from ASCII to EBCDIC. `'A'` has to remain `'A'` regardless from locale. ASCII to UTF-8, that would be possible. – MSalters May 05 '15 at 10:29
2

@Lingxi: `std::isupper` queries the currently installed global C locale, yes, but the phase of compilation that interprets character literals does not. – Lightness Races in Orbit May 05 '15 at 10:51
1

@Lingxi - Just quick note. It is questionable whether `std::isupper` is really needed in most cases. It respects locales used for input from user. But when parsing files, interacting with databases you usually expect some other locale. Moreover at least on Linux these locale related calls are very slow - for example `std::isalpha` calls dynamic_cast two times to "find" proper locale implementation before actually comparing a single character. – ibre5041 May 06 '15 at 07:02

Why is the alphabet split into multiple ranges in this C code?

2 Answers2