Optimize c code by code motion

Question

optimizeMe   (const char* string0, const char* string1)
{
    int i0; 
    int i1  = strlen(string1) - 1;
    int count = 0;

    for (i0 = 0; i0 < strlen(string0); i0++)    
    {
        if (toupper(string0[i0]) == toupper(string1[i1])) 
            count++;
        count++;
        if ((i0%32)==0) 
            i1--;
    }
    return(count / 8);
}

I know I can optimize this code by using register, gcc -o2, reduction in strength i0%32=0x10000, and common expression count/8 = count >> 3, etc;
However, how can I optimize them by code motion? Specifically for if statement and il--.
Any hints are appreciated !

What do you mean by "code motion"? That expression is unknown to me. — Yunnosch, Aug 16 '18 at 05:52
@Yunnosch: Alright, I guess OP is referring to this may be: https://stackoverflow.com/questions/5607762/what-does-code-motion-mean-for-loop-invariant-code-motion — WedaPashi, Aug 16 '18 at 05:55
Unrolling the loop, using `register`, swapping division for shifts is all perfect examples of pre-mature optimization, stay clear of it! There are two notable bottlenecks in this code. One is the repeated calls to strlen() from the loop (compiler should be able to optimize). The other is poor utilization of CPU data size and cache. Which will have to be fixed by redesigning the algorithm to work on chunks of 32 or 64 bits instead. toupper can then no longer be used, you'd need a look-up table. You can look at std library implementations of strcpy for inspiration. — Lundin, Aug 16 '18 at 06:32
Another optimization that could be done is to `restrict` the pointers. — Lundin, Aug 16 '18 at 06:33
regarding: `int i1 = strlen(string1) - 1;` the function: `strlen()` returns a type `size_t`, not an `int` — user3629249, Aug 17 '18 at 22:12

autistic · Answer 1 · 2018-08-16T07:14:33.463

As Lundin pointed out in the comments, these are premature optimisations. They're premature because you're clearly just guessing, and not using a profiler to test your theories on real-world programs with real-world bottlenecks.

They're also micro-optimisations, and we haven't really needed to micro-optimise in C since the 80s, thanks to significant breakthroughs in technology that allow us to do amazing things like mocking up three-dimensional realms in real-time, for example.

gcc already has support for various feature such as dead code elimination, code hoisting (even into compile-time in some cases) and profile-guided optimisation which you might be interested in. I think we take for granted the fact that we have a compiler which can statically deduce when code is unreachable, all by its lonesome; that's a quite complex optimisation for a machine to perform.

By profiling as you test, and then recompiling, feeding the profiler data back into the compiler, the compiler obtains information about how to rearrange branches to be better predicted. That's profile-guided optimisation, and it's semi-automated. I wonder what the authors of Wolfenstein 3D would've done for this kind of technology...

Speaking of profilers, if I may suggest that you test these in some realistic usecases (i.e. actual programs that have active development and a large community):

using register

reduction in strength i%32=0x10000

count/8 = count >> 3

That last optimisation isn't even correct for you (see exercise 5), by the way... This might be a good time to mention the other debugging tools we have in our suite. I can highly recommend checking out ASan (and the other UBSans) for example, will likely save you hours of debugging one day.

It might be best to use size_t since that's what strlen returns for a start, size_t is more portable for use with strlen and quite possibly faster too (due to the fact that size_t has no sign bit and so no potential sign handling overhead when you write things like for (size_t x = 0; x < y; x++))...

... or, if you want provide to your architecture-specific hints to your compiler (which presumably has no profile-guided optimisation, or else you wouldn't need to manually do that), you could also use uint_fast32_t or something else that isn't really suitable for the task, but is still vastly more suitable than int.

I gather you must be getting input from somewhere, or else your program is "pure", in the (functional) sense that it has no side-effects that change the way the user interacts with the system at all (in that case, your compiler might even hoist all of your logic into compile-time evaluation)... Have you considered adjusting the buffer sizes of whichever files and/or streams (i.e. stdin and stdout) you're using? You ought to be able to use setvbuf to do that... If you have many streams open at once, you might want to try choosing a smaller stream buffer size in order to keep all of your stuff in cache. I like to start off with a buffer size of 1, and work my way up from there, that way you'll see precisely where the bottleneck for your system is... To be clear, if you were to unroll loops (which gcc will happily do automatically if it's beneficial)...

If you're using a really primitive compiler (which you're not, though profilers are honestly likely to guide you straight to the heftiest optimisations in any case, optimising your time as a developer), you might be able to suggest to the compiler to emit non-branching code for these lines:

// consider `count += (toupper((unsigned char) string[i0]) == toupper((unsigned char) string1[i1]));`?
if (toupper(string0[i0]) == toupper(string1[i1])) 
    count++;

The casts are necessary to prevent crashes in certain circumstances, by the way... you absolutely need to make sure the only values you pass to a <ctype.h> function are unsigned char values, or EOF.

// consider using `i -= !(i0 % 32);`?
if ((i0%32)==0) 
    i1--;

For your last suggestion: i -= !(i0 % 32); is that possible to use it on count++ as well? ………Oh, I get you ! — 史超南, Aug 16 '18 at 07:41
@史超南 Yes, look at the example above it... though I think you're still missing the point if you continue down this avenue, I don't know your intentions, perhaps you actually need to know this stuff (e.g. you're doing compiler/libc dev)... usually that's not the case, and you should 1/ focus on writing clear code to solve a real problem (so you can be successful), 2/ if it's too slow (once it's complete), use a profiler to determine the bottleneck and use it to optimise the bottleneck rather than guessing, as each time you guess you might make better optimisations more difficult, *cont'd...* — autistic, Aug 16 '18 at 08:57
*cont'd* 3/ verify that your optimisation was a success by running your profiler again with the same set of data and observing the (hopefully eliminated or significantly reduced) bottleneck again... you can repeat steps 2 and 3 until you lose your mind with micro-optimisations if you like, but I think once you've eliminated cache misses and branch prediction misses, you'll probably be quite happy... Work profile-guided optimisation into this routine for a nice improvement... — autistic, Aug 16 '18 at 08:58

Optimize c code by code motion

1 Answers1