The lower-case and upper-case alphabetic ranges don't cross a %32
"alignment" boundary in the ASCII coding system.
This is why bit 0x20
is the only difference between the upper/lower case versions of the same letter.
If this wasn't the case, you'd need to add or subtract 0x20
, not just toggle, and for some letters there would be carry-out to flip other higher bits. (And there wouldn't be a single operation that could toggle, and checking for alphabetic characters in the first place would be harder because you couldn't |= 0x20 to force lcase.)
Related ASCII-only tricks: you can check for an alphabetic ASCII character by forcing lowercase with c |= 0x20
and then checking if (unsigned) c - 'a' <= ('z'-'a')
. So just 3 operations: OR + SUB + CMP against a constant 25. Of course, compilers know how to optimize (c>='a' && c<='z')
into asm like this for you, so at most you should do the c|=0x20
part yourself. It's rather inconvenient to do all the necessary casting yourself, especially to work around default integer promotions to signed int
.
unsigned char lcase = y|0x20;
if (lcase - 'a' <= (unsigned)('z'-'a')) { // lcase-'a' will wrap for characters below 'a'
// c is alphabetic ASCII
}
// else it's not
Or to put it another way:
unsigned char lcase = y|0x20;
unsigned char alphabet_idx = lcase - 'a'; // 0-index position in the alphabet
bool alpha = alphabet_idx <= (unsigned)('z'-'a');
See also Convert a String In C++ To Upper Case (SIMD string toupper
for ASCII only, masking the operand for XOR using that check.)
And also How to access a char array and change lower case letters to upper case, and vice versa
(C with SIMD intrinsics, and scalar x86 asm case-flip for alphabetic ASCII characters, leaving others unmodified.)
These tricks are mostly only useful if hand-optimizing some text-processing with SIMD (e.g. SSE2 or NEON), after checking that none of the char
s in a vector have their high bit set. (And thus none of the bytes are part of a multi-byte UTF-8 encoding for a single character, which might have different upper/lower-case inverses). If you find any, you can fall back to scalar for this chunk of 16 bytes, or for the rest of the string.
There are even some locales where toupper()
or tolower()
on some characters in the ASCII range produce characters outside that range, notably Turkish where I ↔ ı and İ ↔ i. In those locales, you'd need a more sophisticated check, or probably not trying to use this optimization at all.
But in some cases, you're allowed to assume ASCII instead of UTF-8, e.g. Unix utilities with LANG=C
(the POSIX locale), not en_CA.UTF-8
or whatever.
But if you can verify it's safe, you can toupper
medium-length strings much faster than calling toupper()
in a loop (like 5x), and last I tested with Boost 1.58, much much faster than boost::to_upper_copy<char*, std::string>()
which does a stupid dynamic_cast
for every character.