Windows C Runtime toupper slow when locale set

Question

I'm diagnosing an edge case in a cross platform (Windows and Linux) application where toupper is substantially slower on Windows. I'm assuming this is the same for tolower as well.

Originally I tested this with a simple C program on each without locale information set or even including the header file and there was very little performance difference. Test was a million iteration loop calling each character for a string to the toupper() function.

After including the header file and including the line below it's much slower and calls a lot of the MS C runtime library locale specific functions. This is fine but the performance hit is really bad. On Linux this doesn't appear to have any affect at all on performance.

setlocale(LC_ALL, ""); // system default locale

If I set the following it runs as fast as linux but does appear to skip all the locale functions.

setlocale(LC_ALL, NULL); // should be interpreted as the same as below?
OR
setlocale(LC_ALL, "C");

Note: Visual Studio 2015 for Windows 10 G++ for Linux running Cent OS

Have tried dutch settings settings and same outcome, slow on Windows no speed difference on Linux.

Am I doing something wrong or is there a bug with the locale settings on Windows or is it the other way where linux isn't doing what it should? I haven't done a debug on the linux app as I'm not as familiar with linux so do not know exactly what it's doing internally. What should I test next to sort this out?

Code below for testing (Linux):

// C++ is only used for timing.  The original program is in C.
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <chrono>
#include <locale.h>

using namespace std::chrono;

void strToUpper(char *strVal);

int main()
{

    typedef high_resolution_clock Clock;
    high_resolution_clock::time_point t1 = Clock::now();

    // set locale
    //setlocale(LC_ALL,"nl_NL");
    setlocale(LC_ALL,"en_US");

    // testing string
    char str[] = "the quick brown fox jumps over the lazy dog";

    for (int i = 0; i < 1000000; i++)
    {
        strToUpper(str);
    }

    high_resolution_clock::time_point t2 = Clock::now();
    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
    printf("chrono time %2.6f:\n",time_span.count());
}

void strToUpper(char *strVal)
{
    unsigned char *t;
    t = (unsigned char *)strVal;

    while (*t)
    {
        *t = toupper(*t);
        *t++;
    }
}

For windows change the local information to:

// set locale
//setlocale(LC_ALL,"nld_nld");
setlocale(LC_ALL, "english_us");

You can see the locale change from the separator in the time completed, full stop vs comma.

EDIT - Profiling data As you can see above most of the time spent in child system calls from _toupper_l. Without the locale information set the toupper call does NOT call the child _toupper_l which makes it very quick.

Are you testing this against a fully optimized "Release" build? Also, it's possible that GCC is optimizing the loop out completely, since you don't derive any output value from the loop and the string is not `volatile`. — paddy, Apr 18 '16 at 06:03
@paddy Release builds, my original code did output the string in the end and set the variable, I did simplify it for example code. If you increase the iterations it does increase the time spent so I doubt it's optimising out the loop? This still wouldn't explain the performance issue in the production application which does certainly use the output. — Matt B, Apr 18 '16 at 06:25
A way to check if Linux is doing the necessary calls, is to set the locale to turkish. If `toupper('i')`is `'I'` it's not respecting the locale (it should be `'İ'`- that's a capital I with a dot). Might be interesting to test the performance under Linux with a turkish locale too. — Martin Bonner supports Monica, Apr 18 '16 at 06:38
@MattB: did you check that this simplified version still has the performance results you talk about? If not, someone's probably going to waste some time while trying to reproduce this. Also, don't put your microbenchmark in `main`. gcc marks it as "cold" and optimizes it less than other functions, because that's a good thing for real programs. — Peter Cordes, Apr 18 '16 at 06:50
IIRC, glibc's `toupper` implementation uses a table lookup, and so does `isdigit`, `isalpha`, etc., regardless of whether the locale is C or not. IDK what Windows uses. BTW, for ASCII, it only takes [about 4 x86 instructions](http://stackoverflow.com/questions/35932273/how-to-access-a-char-array-and-change-lower-case-letters-to-upper-case-and-vice/35936844#35936844) to downcase a character in a register, but that's only useful if LANG=C is common enough to have a conditional branch checking for it. — Peter Cordes, Apr 18 '16 at 06:59
@PeterCordes Yes I did check that the simplified version had the issue before posting. Thanks about the tip for the benchmark, in the future I'll remember that. In this scenario we are talking under a second on Linux vs 6 seconds on Windows so not sure if that would really make much difference to the outcome. — Matt B, Apr 18 '16 at 07:05
@MartinBonner Thanks for the quick test, I just tested this and it works! I had to change my terminal encoding and changed the locale to tr_TR and the i has a dot above it. Without changing the encoding it was a ? in the terminal window. No difference in speed vs en_US. — Matt B, Apr 18 '16 at 07:08
Have you tried the any of the alternative implementations from [this SO question](http://stackoverflow.com/questions/735204/convert-a-string-in-c-to-upper-case) for upcasing a whole string? The accepted answer uses `boost::to_upper_copy`. Boost might use its own locale stuff instead of using Window's implementation, IDK. Worth trying, esp. if you actually have `std::string`s instead of `char *`s. Otherwise not so nice. — Peter Cordes, Apr 18 '16 at 07:22
@PeterCordes Thanks for the suggestion I had seen boost around but the original project I found this issue in is C not C++. My test had chrono for the easy cross platform timing. Currently I'm trying to make sure that my assumptions that it's the MSVCR library that is the issue compared to Linux. Has anyone attempted the test and had the same issue/results as me? If this is the case I'll talk to MS and look at alternatives. — Matt B, Apr 18 '16 at 07:43
@MartinBonner: I posted a vectorized [`strtoupper` on another C++ question](http://stackoverflow.com/a/37151084/224132). A fallback to scalar for non-ASCII, or for ASCII characters that map to non-ASCII characters, is a potential TODO item, but I only wrote this for fun, and don't have the motivation to do that ATM. Of interest is the finding that ` boost::to_upper_copy` is more than 10x slower than a loop calling glibc's `toupper` (and assuming the result is a single-byte character). And more than 100x slower than a manually-vectorized ASCII-only loop. — Peter Cordes, May 12 '16 at 10:35

score 1 · Accepted Answer · edited May 23 '17 at 12:16

Identical (and fairly good) performance with LANG=C vs. LANG=anything else is expected for the glibc implementation used by Linux.

Your Linux results make sense. Your testing method is probably ok. Use a profiler to see how much time your microbenchmark spends inside the Windows functions. If the Windows implementation does turn out to be the problem, maybe there's a Windows function that can convert whole strings, like the C++ boost::to_upper_copy<std::string> (unless that's even slower, see below).

Also note that upcasing ASCII strings can be SIMD vectorized pretty efficiently. I wrote a case-flip function for a single vector in another answer, using C SSE intrinsics; it can be adapted to upcase instead of flipcase. This should be a huge speedup if you spend a lot of time upcasing strings that are more than 16 bytes long, and that you know are ASCII.

Actually, Boost's to_upper_copy() appears to compile to extremely slow code, like 10x slower than toupper. See that link for my vectorized strtoupper(dst,src), which is ASCII-only but could be extended with a fallback when non-ASCII src bytes are detected.

How does your current code handle UTF-8? There's not much gain in supporting non-ASCII locales if you assume that all characters are a single byte. IIRC, Windows uses UTF-16 for most stuff, which is unfortunate because it turned out that the world wanted more than 2^16 codepoints. UTF-16 is a variable-length encoding of Unicode, like UTF-8 but without the advantage of reading ASCII. Fixed-width has a lot of advantage, but unfortunately you can't assume that even with UTF-16. Java made this mistake, too, and is stuck with UTF-16.

The glibc source is:

#define __ctype_toupper \
     ((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
int toupper (int c) {
    return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}

The asm from the x86-64 Ubuntu 15.10's /lib/x86_64-linux-gnu/libc.so.6 is:

## disassembly from  objconv -fyasm -v2 /lib/x86_64-linux-gnu/libc.so.6 /dev/stdout 2>&1
toupper:
    lea     edx, [rdi+80H]                          ; 0002E300 _ 8D. 97, 00000080
    movsxd  rax, edi                                ; 0002E306 _ 48: 63. C7
    cmp     edx, 383                                ; 0002E309 _ 81. FA, 0000017F
    ja      ?_01766                                 ; 0002E30F _ 77, 19
    mov     rdx, qword [rel ?_37923]                ; 0002E311 _ 48: 8B. 15, 00395AA8(rel)
    sub     rax, -128                               ; 0002E318 _ 48: 83. E8, 80
    mov     rdx, qword [fs:rdx]                     ; 0002E31C _ 64 48: 8B. 12
    mov     rdx, qword [rdx]                        ; 0002E320 _ 48: 8B. 12
    mov     rdx, qword [rdx+48H]                    ; 0002E323 _ 48: 8B. 52, 48
    mov     eax, dword [rdx+rax*4]                  ; 0002E327 _ 8B. 04 82   ## the final table lookup, indexing an array of 4B ints
?_01766:
    rep ret                                         ; actual objconv output shows the prefix on a separate line

So it takes an early-out if the arg isn't in the 0 - 0xFF range (so this branch should predict perfectly not-taken), otherwise it finds the table for the current locale, which involves three pointer dereferences: one load from a global, and one thread-local, and one more dereference. Then it actually indexes into the 256-entry table.

This is the entire library function; the toupper label in the disassembly is what your code calls. (Well, through a layer of indirection through the PLT because of dynamic linking, but after the first call triggers lazy symbol lookup, it's just one extra jmp instruction between your code and those 11 insns in the library.)

1. The upper case version of a lower case ASCII character is not necessarily ASCII (specifically "i" in Turkish. 2. Most European languages can be represented in single byte representations. You "just" have to use the right code-page. — Martin Bonner supports Monica, Apr 18 '16 at 11:03
@MartinBonner: Oh right, I saw your comment on the question. If there are less than 16 cases of ASCII->non-ASCII, you can use SSE4.2 `PCMPISTRI` to check for them at the same time as the terminating zero-byte, to implement locale-aware `strtoupper()` with ASCII SIMD and a scalar fallback. e.g. load a vector of special-case input characters from a per-local array. BTW, what's `tolower('İ')` in the Turkish locale? glibc's `tolower` only does anything for characters in the low 256. — Peter Cordes, Apr 18 '16 at 11:34
@MartinBonner: re: 2: Unicode doesn't use code-pages, though, right? You're just saying that a fixed-width single-byte encoding is possible for most languages, not that Unicode has anything to do with it? — Peter Cordes, Apr 18 '16 at 11:35
2. Yup. Most languages don't need Unicode - but life is much easier if you use it. — Martin Bonner supports Monica, Apr 18 '16 at 19:05
Added profiling information to original question. @PeterCordes Thanks for the code, I don't think it will work as mentioned by MartinBonner with code pages like Turkish. Correct me if I'm wrong but ASCII is only the first 7 bits? Whereas from what I can tell code pages use the whole 8 bits but change the last 128 depending on the current code page. — Matt B, Apr 19 '16 at 02:13
This is what Linux appears to do very quickly and a look at the source linked has locale files including this for turkish: , for the toupper function but this is unicode, this doesn't match the ACSII code page for Turkish of Hex 69 to 98. To add more confusion when I print out the value from Linux I'm getting xDD or decimal 221 (-35). I cannot match this value to the value of "İ" at all that is being printed in the terminal from the software. What am I missing? — Matt B, Apr 19 '16 at 02:44
Okay I was confusing different character sets and how unicode works with code pages. The terminal was set to "Windows-1254" turkish encoding. On [Windows-1254 Codepage](https://msdn.microsoft.com/en-us/goglobal/cc305147.aspx) "İ" is xDD on the code page and Unicode equivilent is U0130. — Matt B, Apr 19 '16 at 03:15

Windows C Runtime toupper slow when locale set

1 Answers1

Linked