Why is VC++ unable to optimize an integer wrapper?

Question

In C++, i'm trying to write a wrapper around a 64 bits integer. My expectation is that if written correctly and all methods are inlined such a wrapper should be as performant as the real type. Answer to this question on SO seems to agree with my expectation.

I wrote this code to test my expectation :

class B
{
private:
   uint64_t _v;

public:
   inline B() {};
   inline B(uint64_t v) : _v(v) {};

   inline B& operator=(B rhs) { _v = rhs._v; return *this; };
   inline B& operator+=(B rhs) { _v += rhs._v; return *this; };
   inline operator uint64_t() const { return _v; };
};

int main(int argc, char* argv[])
{
   typedef uint64_t;
   //typedef B T;
   const unsigned int x = 100000000;

   Utils::CTimer timer;
   timer.start();

   T sum = 0;
   for (unsigned int i = 0; i < 100; ++i)
   {
      for (uint64_t f = 0; f < x; ++f)
      {
         sum += f;
      }
   }

   float time = timer.GetSeconds();

   cout << sum << endl
        << time << " seconds" << endl;

   return 0;
}

When I run this with typedef B T; instead of typedef uint64_t T the reported times are consistently 10% slower when compiled with VC++. With g++ the performances are same if I use the wrapper or not.

Since g++ does it I guess there is no technical reason why VC++ can not optimise this correctly. Is there something I could do to make it optimize it?

I already tried to play with the optimisations flag with no success

Did you run the code from Visual Studio or from a Windows console? — jpo38, Feb 04 '15 at 13:05
I think I tested both, but I'll need to test again to make sure. Could it make a difference? — Mathieu Pagé, Feb 04 '15 at 13:06
How did you compile? I presume Release but what optimization flags did you use? Also is the g++ code faster or has VC++ already optimized the code? — Panagiotis Kanavos, Feb 04 '15 at 13:06
VC++ can be "hideously" [ ;) ] effective during optimization, eg using SIMD (vector) operations when it can. Summing integers can be vectorized/parallelized by the compiler. Summing wrappers can't — Panagiotis Kanavos, Feb 04 '15 at 13:07
I don't have the exact times with me but g++ versions were faster than VC++ with or without wrapper. — Mathieu Pagé, Feb 04 '15 at 13:14
As @T.C. answered, g++ optimized the loop away entirely. Both benchmarks fail, in the sense that they don't measure the effect of wrapping. On the other hand, they *do* show that wrapping has side effects, ie it prevents parallelization — Panagiotis Kanavos, Feb 04 '15 at 13:15

score 4 · Answer 1 · answered Feb 04 '15 at 13:12

For the record, this is what g++ and clang++'s generated assembly at -O2 translates to (in both wrapper and non-wrapper cases), modulo the timing part:

sum = 499999995000000000;
cout << sum << endl;

In other words, it optimized the loop out entirely. Regardless of how hard you try to vectorize the loop, it's rather hard to beat not looping at all :)

score 3 · Answer 2 · answered Feb 04 '15 at 13:08

3

Using /O2 (maximize speed), both alternatives generate exactly the same assembly using Visual Studio 2012. This is your code, minus the timing and output:

00FB1000  push        ebp  
00FB1001  mov         ebp,esp  
00FB1003  and         esp,0FFFFFFF8h  
00FB1006  sub         esp,8  
00FB1009  mov         edx,64h  
00FB100E  mov         edi,edi  
00FB1010  xorps       xmm0,xmm0  
00FB1013  movlpd      qword ptr [esp],xmm0  
00FB1018  mov         ecx,dword ptr [esp+4]  
00FB101C  mov         eax,dword ptr [esp]  
00FB101F  nop  
00FB1020  add         eax,1  
00FB1023  adc         ecx,0  
00FB1026  jne         main+2Fh (0FB102Fh)  
00FB1028  cmp         eax,5F5E100h  
00FB102D  jb          main+20h (0FB1020h)  
00FB102F  dec         edx  
00FB1030  jne         main+10h (0FB1010h)  
00FB1032  xor         eax,eax

I'd presume that the measured times fluctuate or are not always correct.

answered Feb 04 '15 at 13:08

Daerst

954
7
24

`xmm0` ! MMX registers! It *did* vectorize the operation! – Panagiotis Kanavos Feb 04 '15 at 13:09
@PanagiotisKanavos Indeed, a rare sight I'd say. – Daerst Feb 04 '15 at 13:12
1

Not rare actually, VC is only surpassed by Intel's own compilers in parallelizing code. – Panagiotis Kanavos Feb 04 '15 at 13:13
I spent quite some time optimizing a tight loop by hand where Visual Studio would use lots of `mulss`, but never `mulps`, although it was perfectly possible. Made me lose a bit of confidence ;) – Daerst Feb 04 '15 at 13:15
What version? Each successive version has a lot of improvements, and there *were* two major new versions in the last couple of years. Moreover, newer agreements with Intel mean that the latest versions contain *larger* parts of Intel's vectorization technology, parallel libraries etc – Panagiotis Kanavos Feb 04 '15 at 13:17
VS2012, as it was in an ongoing project. I'm gonna check the code in VS2013 and the VS2015 Preview :) – Daerst Feb 04 '15 at 13:19
Also VS2015 - yet another major version coming out :) – Panagiotis Kanavos Feb 04 '15 at 13:19
@T.C. On a closer look the MMX register seems to be used to store the inner `for` loop's counting variable `uint64_t f`, right? – Daerst Feb 04 '15 at 13:48

Why is VC++ unable to optimize an integer wrapper?

2 Answers2