7

Here's the function I'm looking at:

template <uint8_t Size>
inline uint64_t parseUnsigned( const char (&buf)[Size] )
{
  uint64_t val = 0;
  for (uint8_t i = 0; i < Size; ++i)
    if (buf[i] != ' ')
      val = (val * 10) + (buf[i] - '0');
  return val;
}

I have a test harness which passes in all possible numbers with Size=5, left-padded with spaces. I'm using GCC 4.7.2. When I run the program under callgrind after compiling with -O3 I get:

I   refs:      7,154,919

When I compile with -O2 I get:

I   refs:      9,001,570

OK, so -O3 improves the performance (and I confirmed that some of the improvement comes from the above function, not just the test harness). But I don't want to completely switch from -O2 to -O3, I want to find out which specific option(s) to add. So I consult man g++ to get the list of options it says are added by -O3:

-fgcse-after-reload                         [enabled]
-finline-functions                          [enabled]
-fipa-cp-clone                              [enabled]
-fpredictive-commoning                      [enabled]
-ftree-loop-distribute-patterns             [enabled]
-ftree-vectorize                            [enabled]
-funswitch-loops                            [enabled]

So I compile again with -O2 followed by all of the above options. But this gives me even worse performance than plain -O2:

I   refs:      9,546,017

I discovered that adding -ftree-vectorize to -O2 is responsible for this performance degradation. But I can't figure out how to match the -O3 performance with any combination of options. How can I do this?

In case you want to try it yourself, here's the test harness (put the above parseUnsigned() definition under the #includes):

#include <cmath>
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>

template <uint8_t Size>
inline void increment( char (&buf)[Size] )
{
  for (uint8_t i = Size - 1; i < 255; --i)
  {
    if (buf[i] == ' ')
    {
      buf[i] = '1';
      break;
    }

    ++buf[i];
    if (buf[i] > '9')
      buf[i] -= 10;
    else
      break;
  }
}

int main()
{
  char str[5];
  memset(str, ' ', sizeof(str));

  unsigned max = std::pow(10, sizeof(str));
  for (unsigned ii = 0; ii < max; ++ii)
  {
    uint64_t result = parseUnsigned(str);
    if (result != ii)
    {
      printf("parseUnsigned(%*s) from %u: %lu\n", sizeof(str), str, ii, result);
      abort();
    }
    increment(str);
  }
}
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Did you time it, in addition to using callgrind? – Ben Voigt Aug 21 '14 at 02:59
  • If you really want to optimize this function, dispatch based on `Size` so that 64-bit arithmetic is only used when the string is large enough to store numbers above `UINT_MAX`. – Ben Voigt Aug 21 '14 at 03:01
  • Instead of looking at the man page, try using [this command](http://stackoverflow.com/a/5470379/541686) to tell which options are really getting activated. – user541686 Aug 21 '14 at 03:01
  • @BenVoigt: I did now. I changed `sizeof(str)` from 5 to 9. Then `-O2` takes 11 seconds, `-O3` takes 6.8 s, and `-O2 -ftree-vectorize` takes 13 seconds. This is on a 2.7 GHz Xeon E5-2697 running 64-bit Linux. – John Zwinck Aug 21 '14 at 03:02
  • @Mehrdad: I got the options list I posted by running the commands the man page told me to (but with g++ instead of gcc): `g++ -c -Q -O3 --help=optimizers > /tmp/O3-opts; g++ -c -Q -O2 --help=optimizers > /tmp/O2-opts; diff /tmp/O2-opts /tmp/O3-opts | grep enabled`. Do you have something else in mind? – John Zwinck Aug 21 '14 at 03:03
  • @Praetorian: That's exactly how I got the list of additional options in my question. That generated list included one extra option not mentioned by the man page part about `-O3`, by the way. – John Zwinck Aug 21 '14 at 03:05
  • @JohnZwinck: Ohh I misunderstood probably, never mind. – user541686 Aug 21 '14 at 03:07
  • size 10 (9 digits, one terminating NUL) fits in a 30-bit integer. If the compiler is choosing 32-bit math at higher optimization levels, that could account for the difference – Ben Voigt Aug 21 '14 at 03:07
  • @BenVoigt: I tried changing from uint64_t to uint32_t. It reduced the running time for each optimization level by about 1 second, which is nice but not nearly as impactful as going from `-O2` to `-O3`. Thanks for the idea just the same, it's useful even if orthogonal to my question. – John Zwinck Aug 21 '14 at 03:10
  • What happens if you add `-Ofast`? That seems to match `-O3` for me. Edit: I'm stupid, it turns on `-O3`. –  Aug 21 '14 at 03:17
  • @remyabel: yes, `-Ofast` gives the same performance as `-O3`. I don't think this is very surprising. And `-Os` gives similar performance to `-O2 -ftree-vectorize`, which is the worst. – John Zwinck Aug 21 '14 at 03:18
  • 1
    https://gcc.gnu.org/wiki/FAQ#optimization-options – OmnipotentEntity Aug 21 '14 at 03:22

1 Answers1

6

A very similar question was already answered here: https://stackoverflow.com/a/6454659/483486

I've copied the relevant text underneath.

UPDATE: There are questions about it in GCC WIKI:

  • "Is -O1 (-O2,-O3 or -Os) equivalent to individual -foptimization options?"

No. First, individual optimization options (-f*) do not enable optimization, an option -Os or -Ox with x > 0 is required. Second, the -Ox flags enable many optimizations that are not controlled by any individual -f* option. There are no plans to add individual options for controlling all these optimizations.

  • "What specific flags are enabled by -O1 (-O2, -O3 or -Os)?"

Varies by platform and GCC version. You can get GCC to tell you what flags it enables by doing this:

touch empty.c
gcc -O1 -S -fverbose-asm empty.c
cat empty.s
Community
  • 1
  • 1
OmnipotentEntity
  • 16,531
  • 6
  • 62
  • 96
  • So you think there is no way to get GCC to optimize this code more effectively without throwing the kitchen sink of `-O3` at it? The difference in running time is not small--11 seconds vs. 6.8 seconds as per my comment on the question. – John Zwinck Aug 21 '14 at 03:28
  • Essentially, there is no optimization level you can choose between `-O2 -faBunchOfStuffThatO3Enables` and `-O3` without editing the source of the compiler according to everything I've found on the matter. – OmnipotentEntity Aug 21 '14 at 03:30
  • 1
    You can also use `-O3 -fno-aBunchOfStuff` to narrow the gap. And play with some `--param` (see `--help=param` for a list). – Marc Glisse Aug 21 '14 at 13:35