Does moving out a unchanged computation out of a loop improve a program's performance?

Question

In Computer Systems: a Programmer's Perspective, there is an example for optimizing program performance. Will moving 'A'-'a' out of the loop further improve the program's performance? Why isn't the change included in the faster version? Thanks.

The code from "Figure 5.7 Lowercase conversion routines. The two procedures have radically different performance":

/* Convert string to lowercase: slow */
void lower1(char *s)
{
  long i;

  for (i = 0; i < strlen(s); i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

/* Convert string to lowercase: faster */
void lower2(char *s)
{
  long i;
  long len = strlen(s);

  for (i = 0; i < len; i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

/* Sample implementation of library function strlen */
/* Compute length of string */
size_t strlen(const char *s)
{
  long length = 0;
  while (*s != '\0') {
    s++;
    length++;
  }
  return length;
}

Since `'A' - 'a'` is an operation on two compile time constants chances are high the compiler will replace it with a constant value — UnholySheep, Sep 28 '20 at 15:01
`'A' - 'a'` is a compile-time constant, which I'd expect any compiler worth its salt to compute at compile-time, rather than it *ever* happening at execution time. — Jon Skeet, Sep 28 '20 at 15:01
You need to keep in mind what the compiler's optimizer will do automatically. Since `'A' - 'a'` is a constant expression, the compiler will evaluate it at compile time, so no code will be generated for the subtraction. — Tom Karzes, Sep 28 '20 at 15:01
Why do you not benchmark and test it yourself? For your particular system, with the compiler and the optimization you use. It will be different on each system, compiler and some compiler flags. — 12431234123412341234123, Sep 28 '20 at 15:02
Images of code are hard to use and shouldn't be posted. Plase post your code **as text directly in your question**. — MikeCAT, Sep 28 '20 at 15:03
Adding to Tom Karzes: However `strlen` is not a constant expression and the compiler might not notice that the return value will not change between the loop iterations and so moving it out of the loop head reduces the number of calls to `strlen` from `strlen(s)+1` to 1. — Ackdari, Sep 28 '20 at 15:04
I doubt that the 2 functions differ at all, with enough compiler optimization enabled. Smart compilers could detect that `strlen()` is constant and move it out of the loop. But i would need the code to test it, not an image. The book is from 2002, compiler made huge advantages in optimizations since then. — 12431234123412341234123, Sep 28 '20 at 15:07
@JonSkeet (1) How do you tell if a non-value expression is a compile time constant or not? (2) Will manually moving it out of the loop by a temporary variable introduce overhead? — Tim, Sep 28 '20 at 15:08
Please indent your code before posting. Make it easier for your peers to read and analyze it! — Notinlist, Sep 28 '20 at 15:12
`strlen()` returns `size_t` not `long`, `long len = strlen(s); ` can cause UB. — 12431234123412341234123, Sep 28 '20 at 15:28
The performance difference between `lower1` and `lower2` comes from the `strlen`. You might find [this reading](https://en.wikichip.org/wiki/schlemiel_the_painter%27s_algorithm) interesting. — Jabberwocky, Sep 28 '20 at 15:33
I tested it, i was wrong, at least on my debian AMD64 with GCC 8.3.0 and `-O3`, moving `strlen()` out of the loop makes it faster. You can make it even faster by not using `strlen()` and check for `'\0'` in the same loop. This was the fastest i came up with: `while(1) { if(!*s) {break;} if(*s >= 'A' && *s <= 'Z') {*s |= 0x20;} s++; }` — 12431234123412341234123, Sep 28 '20 at 16:07

score 3 · Answer 1 · answered Sep 28 '20 at 15:41

3

To directly answer your question: Yes, moving functions out of loops does improve performance. The only question is if the function depends on some value in the loop. If it does, then it needs to be in the loop.

In the example code in your question the slow function is strlen, not the character subtraction. Every time that the for loop checks i < strlen(s) it has to run the strlen function.

It depends some on how smart the compiler is. Because strlen is part of the C standard, compilers are free to assume how it works. So they know it only changes if the string changes, so in many cases they can cache the value outside of the loop. But in the example code, the string is being changed so that optimization probably won't happen.

And let me add to and reiterate one point there. The compiler assumes that functions with the same name as in the C standard work exactly the way they are defined in the standard.

So if you write your own version of strlen, malloc or memcpy it had better exactly match the specification. Because the compilers can assume how they work and even replace the function call with equivalent code.

answered Sep 28 '20 at 15:41

Zan Lynx

53,022
10
79
131

3

The string is being changed, but a *really* smart compiler could recognize that the code will never set a non-zero character to 0, or change a 0 character to non-zero, and therefore will not change the value returned by strlen() (all that assuming ASCII, just like that sad example does) – Sep 28 '20 at 18:36
@user431397: Indeed. Unfortunately, the computational cost (and thus compile time) to look for optimizations like that is high. And often defeated anyway by loops that pass the string to a non-inline function which could make unknown changes to it. So it's generally not worth it for compiler devs to write code to look for such an optimization: not provably possible in a lot of source code, expensive all the time to look for, and easy / obvious for humans to solve if they give any thought to performance. – Peter Cordes Sep 28 '20 at 21:15
*So if you write your own version of strlen, malloc or memcpy it had better exactly match the specification* - GCC does have a `-fno-builtin` option, or `-fno-builtin-strlen` to not internally define `strlen` as `__builtin_strlen`. But yes, unless you're writing an OS kernel or a C library, it's better to just use different names if you want to define your own functions that are similar but different from ISO C standard functions. – Peter Cordes Sep 28 '20 at 21:17
@PeterCordes: I believe the optimizer still makes assumptions about functions. Like with memcpy, that it can replace an assignment with a call to memcpy, or that after a memcpy both objects will have the same value and can be used interchangably. That sort of thing. – Zan Lynx Sep 30 '20 at 14:56
The mechanism for making those assumptions is to pre-define `memcpy` as a "builtin" function that GCC knows what it does. It might still emit a call to the asm symbol name `memcpy`, but other times (like small fixed sizes) it can just inline a load and/or store instruction. If you disable that with `-fno-builtin-memcpy`, it will treat `memcpy()` just like any other unknown function that it can't see the definition of (and thus can't inline or make assumptions about). Example: https://godbolt.org/z/d7je83 – Peter Cordes Sep 30 '20 at 22:19

12431234123412341234123 · Answer 2 · 2020-09-28T21:16:23.923

TL;DR: I tested it on 2 systems and no, it was not faster on both systems. But you can make the code faster with removing the strlen().

I tested both functions and a lot more. I tested what happens when i replace -('A'-'a') with +0x20, replaced + by | and replacing the for loop with a while loop. Result is that this does not change performance.

However, moving strlen() to the beginning and only call it once improved performance. A way to improve performance even more was to remove the call to strlen() and check for the '\0'-byte inside the same loop. This way we have to go trough the loop only once, this probably reduces cache misses on longer strings.

Testprogram

I tested it by creating an array of random strings, copy the array and lowering all copies with one method while measuring the lowering time. Then i did the same, with the same random array of strings, with all the other methods. And i repeated this multiple times.

The code should work on POSIX compatible systems, but you probably have to replace the GetTime() function for other systems such as Windows. I compiled it with GCC and the -O3 flag.

#include <string.h>
#include <stdio.h>
#include <stdint.h>
#include <errno.h>
#include <stdlib.h>
#include <inttypes.h>

#include <time.h>


//#define DEBUG


#ifdef DEBUG
  #define N 10
#else
  #define N 1000UL*100
#endif

#define M 20

#define STR_(x) #x
#define STR(x) STR_(x)



void lower1(char *s)
{
  size_t i;

  for (i = 0; i < strlen(s); i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] -= ('A' - 'a');
    }
  }
}

void lower2(char *s)
{
  size_t i;
  size_t len = strlen(s);

  for (i = 0; i < len; i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] -= ('A' - 'a');
    }
  }
}

void lower3(char *s)
{
  size_t i;
  size_t len = strlen(s);
  int d='A'-'a';

  for (i = 0; i < len; i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] -= d;
    }
  }
}

void lower4(char *s)
{
  size_t i;
  size_t len = strlen(s);

  for (i = 0; i < len; i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] += 0x20;
    }
  }
}

void lower5(char *s)
{
  size_t i;

  for (i = 0; i < strlen(s); i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] += ('a' - 'A');
    }
  }
}

void lower6(char *s)
{
  size_t i;
  for (i = 0; i < strlen(s); i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] |= 0x20;
    }
  }
}

void lower7(char *s)
{
  size_t i;
  size_t len = strlen(s);

  for (i = 0; i < len; i++)
  {
    if (s[i] >= 'A' && s[i] <= 'Z')
    {
      s[i] |= 0x20;
    }
  }
}

void lower8(char *s)
{
  size_t len = strlen(s);
  while(len--)
    {
      if (*s >= 'A' && *s <= 'Z')
      {
        *s |= 0x20;
      }
      s++;
    }
}

void lower9(char *s)
{
  while(1)
  {
    if (!*s)
    {
      break;
    }
    if (*s >= 'A' && *s <= 'Z')
    {
      *s |= 0x20;
    }
    s++;
  }
}

void lowerA(char *s)
{
  while(*s)
  {
    if (*s >= 'A' && *s <= 'Z')
    {
      *s |= 0x20;
    }
    s++;
  }
}

uint64_t die(const char *msg)
  {
    fprintf(stderr,"die: %s : %s\n",msg,strerror(errno));
    exit(1);
  }

uint64_t getTime(void)
  {
    uint64_t time;
    struct timespec  t_v;
    if(clock_gettime(CLOCK_BOOTTIME,&t_v)<0)
      {
        die("cant get time");
      }
    time=t_v.tv_sec*1000000000ULL;
    time+=t_v.tv_nsec;
    return time;
  }
  

void test(void (*fp)(char *),char (*s)[M],const char *name)
  {
    static char (*copy)[M];
    copy=malloc(N*M);
    if(!copy)
      {
        die("can't alloc memory");
      }
    memcpy(copy,s,N*M);
    uint64_t start=getTime();
    for(size_t u=0;u<N;u++)
      {
        fp(copy[u]);
      }
    uint64_t end=getTime();
    printf("time %13"PRIu64" %s\n",end-start,name);
    #ifdef DEBUG
      for(size_t u=0;u<N;u++)
        {
          printf("%3zu %"STR(M)"s %"STR(M)"s\n",u,s[u],copy[u]);
        }
    #endif
    free(copy);
  }
  
void runTest(void)
{
  //create a random string
  srand(getTime());
  static char string[N][M];
  for(size_t u=0;u<N;u++)
  {
    size_t l=rand()%M;
    for(size_t i=0;i<l;i++)
    {
      string[u][i]=rand()%('z'-'/')+'/';
    }
    string[u][l]='\0';
  }
  #define TEST(s) test(s,string,STR(s))
  TEST(lower1);
  TEST(lower2);
  TEST(lower3);
  TEST(lower4);
  TEST(lower5);
  TEST(lower6);
  TEST(lower7);
  TEST(lower8);
  TEST(lower9);
  TEST(lowerA);
}

int main(void)
{
  for(unsigned i=0;i<8;i++)
  {
    runTest();
  }
  return 1;
}

The disassembly on AMD64 shows that functions lower1(), lower5() and lower6() (functions that call strlen() in every loop, compiler did not optimize that call) are almost identical with the exception of addresses and that a add instructions was replaced by the or instruction. lower2(), lower3(), lower4() and lower7() (functions where strlen() is only called once and for is used) are also almost identical. lower8() is different from each other (uses strlen() once and a while-loop). loop9() and loopA() are almost identical and do not call strlen())

Results

On my Debian 9 Stretch ARM running on a Raspberry Pi, the functions lower9() and lowerA() are equally as fast and faster than all other tested functions. lower2(), lower3(), lower4(), lower7() and lower8() took about 58-66% more time but are equally to each other. Dispate the different assembly for lower8() the execution time did not differ significantly. lower1() and lower6() took about 297-348% longer than lower9() and lowerA(), interestingly lower5() took even longer (consistent in multiple measurements) with 324%-375%. I do not know why lower5() took longer since it uses the same machine code except for different addresses (this is also true for the ARM code).

On my Debian 10 Buster AMD64, the function lowerA() is the fastest, faster than lower9() by about 3%-6%. I don't know why. But lower5() is here as fast as lower1() and lower6().

[Identical functions have different performance, why?](https://stackoverflow.com/q/64108588) - `lowerA` and `lower9` compile to identical machine code, as expected from the same logic. Differences are probably due to minor effects of code alignment when calling one after the other or vice versa, or measurement artifacts. — Peter Cordes, Sep 28 '20 at 21:10
If you actually want this to run fast for ASCII, you can do 16 bytes at a time (after getting to an alignment boundary to avoid possibly reading into an unmapped page): [Convert a String In C++ To Upper Case](https://stackoverflow.com/a/37151084) shows that using strlen once outside the loop can actually help, letting `gcc -O3` auto-vectorize. This helps a lot for strings 128 bytes long or longer, not so much for only 16 byte strings. — Peter Cordes, Sep 28 '20 at 21:23

Does moving out a unchanged computation out of a loop improve a program's performance?

2 Answers2

Testprogram

Results

Linked