C++ vs Java? Why does the ICC generate slower code than VC?

Question

The following is a simple loop in C++. The timer is using QueryPerformanceCounter() and is quite accurate. I found Java to take 60% of the time C++ takes and this can't be?! What am I doing wrong here? Even strict aliasing (which is not included in the code here) doesn't help at all...

long long var = 0;
std::array<int, 1024> arr;
int* arrPtr = arr.data();
CHighPrecisionTimer timer;

for(int i = 0; i < 1024; i++) arrPtr[i] = i;

timer.Start();

for(int i = 0; i < 1024 * 1024 * 10; i++){
    for(int x = 0; x < 1024; x++){
        var += arrPtr[x];
    }
}

timer.Stop();

printf("Unrestricted: %lld us, Value = %lld\n", (Int64)timer.GetElapsed().GetMicros(), var);

This C++ runs through in about 9.5 seconds. I am using the Intel Compiler 12.1 with host processor optimization (specifically for mine) and everything maxed. So this is Intel Compiler at its best! Auto-Parallelization funnily consumes 70% CPU instead of 25% but doesn't get the job done any faster ;)...

Now I use the following Java code for comparison:

    long var = 0;
    int[] arr = new int[1024];

    for(int i = 0; i < 1024; i++) arr[i] = i;

    for(int i = 0; i < 1024 * 1024; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }

    long nanos = System.nanoTime();

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }

    nanos = (System.nanoTime() - nanos) / 1000;

    System.out.print("Value: " + var + ", Time: " + nanos);

The Java code is invoked with aggressive optimization and the server VM (no debug). It runs in about 7 seconds on my machine (only uses one thread).

Is this a failure of the Intel Compiler or am I just too dumb again?

[EDIT]: Ok now heres the thing... Seems more like a bug in the Intel compiler ^^. [Please note that I am running on the Intel Quadcore Q6600, which is rather old. And it might be that the Intel Compiler performs way better on recent CPUs, like Core i7]

Intel x86 (without vectorization): 3 seconds
MSVC x64: 5 seconds
Java x86/x64 (Oracle Java 7): 7 seconds
Intel x64 (with vectorization): 9.5 seconds
Intel x86 (with vectorization): 9.5 seconds
Intel x64 (without vectorization): 12 seconds
MSVC x86: 15 seconds (uhh)

[EDIT]: Another nice case ;). Consider the following trivial lambda expression

#include <stdio.h>
#include <tchar.h>
#include <Windows.h>
#include <vector>
#include <boost/function.hpp>
#include <boost/lambda/bind.hpp>
#include <boost/typeof/typeof.hpp>

template<class TValue>
struct ArrayList
{
private:
    std::vector<TValue> m_Entries;
public:

    template<class TCallback>
    void Foreach(TCallback inCallback)
    {
        for(int i = 0, size = m_Entries.size(); i < size; i++)
        {
            inCallback(i);
        }
    }

    void Add(TValue inValue)
    {
        m_Entries.push_back(inValue);
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    auto t = [&]() {};


    ArrayList<int> arr;
    int res = 0;

    for(int i = 0; i < 100; i++)
    {
        arr.Add(i);
    }

    long long freq, t1, t2;

    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
    QueryPerformanceCounter((LARGE_INTEGER*)&t1);

    for(int i = 0; i < 1000 * 1000 * 10; i++)
    {
        arr.Foreach([&](int v) {
            res += i;
        });
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);

    printf("Time: %lld\n", ((t2-t1) * 1000000) / freq);

    if(res == 4950)
        return -1;

    return 0;
}

Intel compiler shines again:

MSVC x86/x64: 12 milli seconds
Intel x86/x64: 1 second

Uhm?! Well, I guess 90 times slower is not a bad thing...

I am not really sure anymore that this applies: Okay and based on an answer to this thread: The intel compiler is known (and I knew that too but I just didn't think about that they could drop support for their processors) to have terrible performance on processors which are not "known" to the compiler, like AMD processors, and maybe even outdated Intel processors like mine... So if someone with a recent Intel processor could try this out it would be nice ;).

Here is the x64 output of the Intel Compiler:

    std::array<int, 1024> arr;
    int* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013F05101D  lea         rcx,[freq]  
000000013F051022  call        qword ptr [__imp_QueryPerformanceFrequency (13F052000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013F051028  mov         eax,4  
000000013F05102D  movd        xmm0,eax  
000000013F051031  xor         eax,eax  
000000013F051033  pshufd      xmm1,xmm0,0  
000000013F051038  movdqa      xmm0,xmmword ptr [__xi_z+28h (13F0521A0h)]  
000000013F051040  movdqa      xmmword ptr arr[rax*4],xmm0  
000000013F051046  paddd       xmm0,xmm1  
000000013F05104A  movdqa      xmmword ptr [rsp+rax*4+60h],xmm0  
000000013F051050  paddd       xmm0,xmm1  
000000013F051054  movdqa      xmmword ptr [rsp+rax*4+70h],xmm0  
000000013F05105A  paddd       xmm0,xmm1  
000000013F05105E  movdqa      xmmword ptr [rsp+rax*4+80h],xmm0  
000000013F051067  add         rax,10h  
000000013F05106B  paddd       xmm0,xmm1  
000000013F05106F  cmp         rax,400h  
000000013F051075  jb          wmain+40h (13F051040h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013F051077  lea         rcx,[t1]  
000000013F05107C  call        qword ptr [__imp_QueryPerformanceCounter (13F052008h)]  
            var += arrPtr[x];
000000013F051082  movdqa      xmm1,xmmword ptr [__xi_z+38h (13F0521B0h)]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013F05108A  xor         eax,eax  
            var += arrPtr[x];
000000013F05108C  movdqa      xmm0,xmmword ptr [__xi_z+48h (13F0521C0h)]  
    long long var = 0, freq, t1, t2;
000000013F051094  pxor        xmm6,xmm6  
        for(int x = 0; x < 1024; x++){
000000013F051098  xor         r8d,r8d  
            var += arrPtr[x];
000000013F05109B  lea         rdx,[arr]  
000000013F0510A0  xor         ecx,ecx  
000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]  
        for(int x = 0; x < 1024; x++){
000000013F0510A8  add         r8,8  
            var += arrPtr[x];
000000013F0510AC  punpckldq   xmm2,xmm2  
        for(int x = 0; x < 1024; x++){
000000013F0510B0  add         rcx,20h  
            var += arrPtr[x];
000000013F0510B4  movdqa      xmm3,xmm2  
000000013F0510B8  pand        xmm2,xmm0  
000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]  
000000013F0510C1  psrad       xmm3,1Fh  
000000013F0510C6  punpckldq   xmm4,xmm4  
000000013F0510CA  pand        xmm3,xmm1  
000000013F0510CE  por         xmm3,xmm2  
000000013F0510D2  movdqa      xmm5,xmm4  
000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]  
000000013F0510DB  psrad       xmm5,1Fh  
000000013F0510E0  punpckldq   xmm2,xmm2  
000000013F0510E4  pand        xmm5,xmm1  
000000013F0510E8  paddq       xmm6,xmm3  
000000013F0510EC  pand        xmm4,xmm0  
000000013F0510F0  movdqa      xmm3,xmm2  
000000013F0510F4  por         xmm5,xmm4  
000000013F0510F8  psrad       xmm3,1Fh  
000000013F0510FD  movq        xmm4,mmword ptr [rdx+18h]  
000000013F051102  pand        xmm3,xmm1  
000000013F051106  punpckldq   xmm4,xmm4  
000000013F05110A  pand        xmm2,xmm0  
000000013F05110E  por         xmm3,xmm2  
000000013F051112  movdqa      xmm2,xmm4  
000000013F051116  paddq       xmm6,xmm5  
000000013F05111A  psrad       xmm2,1Fh  
000000013F05111F  pand        xmm4,xmm0  
000000013F051123  pand        xmm2,xmm1  
        for(int x = 0; x < 1024; x++){
000000013F051127  add         rdx,20h  
            var += arrPtr[x];
000000013F05112B  paddq       xmm6,xmm3  
000000013F05112F  por         xmm2,xmm4  
        for(int x = 0; x < 1024; x++){
000000013F051133  cmp         r8,400h  
            var += arrPtr[x];
000000013F05113A  paddq       xmm6,xmm2  
        for(int x = 0; x < 1024; x++){
000000013F05113E  jb          wmain+0A2h (13F0510A2h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013F051144  inc         eax  
000000013F051146  cmp         eax,0A00000h  
000000013F05114B  jb          wmain+98h (13F051098h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013F051151  lea         rcx,[t2]  
000000013F051156  call        qword ptr [__imp_QueryPerformanceCounter (13F052008h)]  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F05115C  mov         r9,qword ptr [t2]  
    long long var = 0, freq, t1, t2;
000000013F051161  movdqa      xmm0,xmm6  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051165  sub         r9,qword ptr [t1]  
000000013F05116A  lea         rcx,[string "Unrestricted: %lld ms, Value = %"... (13F0521D0h)]  
000000013F051171  imul        rax,r9,3E8h  
000000013F051178  cqo  
000000013F05117A  mov         r10,qword ptr [freq]  
000000013F05117F  idiv        rax,r10  
    long long var = 0, freq, t1, t2;
000000013F051182  psrldq      xmm0,8  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051187  mov         rdx,rax  
    long long var = 0, freq, t1, t2;
000000013F05118A  paddq       xmm6,xmm0  
000000013F05118E  movd        r8,xmm6  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051193  call        qword ptr [__imp_printf (13F052108h)]

And this one is the assembly of the MSVC x64 build:

int _tmain(int argc, _TCHAR* argv[])
{
000000013FF61000  push        rbx  
000000013FF61002  mov         eax,1050h  
000000013FF61007  call        __chkstk (13FF61950h)  
000000013FF6100C  sub         rsp,rax  
000000013FF6100F  mov         rax,qword ptr [__security_cookie (13FF63000h)]  
000000013FF61016  xor         rax,rsp  
000000013FF61019  mov         qword ptr [rsp+1040h],rax  
    long long var = 0, freq, t1, t2;
    std::array<int, 1024> arr;
    int* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013FF61021  lea         rcx,[rsp+28h]  
000000013FF61026  xor         ebx,ebx  
000000013FF61028  call        qword ptr [__imp_QueryPerformanceFrequency (13FF62000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013FF6102E  xor         r11d,r11d  
000000013FF61031  lea         rax,[rsp+40h]  
000000013FF61036  mov         dword ptr [rax],r11d  
000000013FF61039  inc         r11d  
000000013FF6103C  add         rax,4  
000000013FF61040  cmp         r11d,400h  
000000013FF61047  jl          wmain+36h (13FF61036h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013FF61049  lea         rcx,[rsp+20h]  
000000013FF6104E  call        qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]  
000000013FF61054  mov         r11d,0A00000h  
000000013FF6105A  nop         word ptr [rax+rax]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
000000013FF61060  xor         edx,edx  
000000013FF61062  xor         r8d,r8d  
000000013FF61065  lea         rcx,[rsp+48h]  
000000013FF6106A  xor         r9d,r9d  
000000013FF6106D  mov         r10d,100h  
000000013FF61073  nop         word ptr [rax+rax]  
            var += arrPtr[x];
000000013FF61080  movsxd      rax,dword ptr [rcx-8]  
000000013FF61084  add         rcx,10h  
000000013FF61088  add         rbx,rax  
000000013FF6108B  movsxd      rax,dword ptr [rcx-14h]  
000000013FF6108F  add         r9,rax  
000000013FF61092  movsxd      rax,dword ptr [rcx-10h]  
000000013FF61096  add         r8,rax  
000000013FF61099  movsxd      rax,dword ptr [rcx-0Ch]  
000000013FF6109D  add         rdx,rax  
000000013FF610A0  dec         r10  
000000013FF610A3  jne         wmain+80h (13FF61080h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
000000013FF610A5  lea         rax,[rdx+r8]  
000000013FF610A9  add         rax,r9  
000000013FF610AC  add         rbx,rax  
000000013FF610AF  dec         r11  
000000013FF610B2  jne         wmain+60h (13FF61060h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013FF610B4  lea         rcx,[rsp+30h]  
000000013FF610B9  call        qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013FF610BF  mov         rax,qword ptr [rsp+30h]  
000000013FF610C4  lea         rcx,[string "Unrestricted: %lld ms, Value = %"... (13FF621B0h)]  
000000013FF610CB  sub         rax,qword ptr [rsp+20h]  
000000013FF610D0  mov         r8,rbx  
000000013FF610D3  imul        rax,rax,3E8h  
000000013FF610DA  cqo  
000000013FF610DC  idiv        rax,qword ptr [rsp+28h]  
000000013FF610E1  mov         rdx,rax  
000000013FF610E4  call        qword ptr [__imp_printf (13FF62138h)]  

    return 0;
000000013FF610EA  xor         eax,eax

Intel Compiler configured without Vectorization, 64-Bit, highest optimizations (this is surprisingly slow, 12 seconds):

000000013FC0102F  lea         rcx,[freq]  

    double var = 0; long long freq, t1, t2;
000000013FC01034  xorps       xmm6,xmm6  
    std::array<double, 1024> arr;
    double* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013FC01037  call        qword ptr [__imp_QueryPerformanceFrequency (13FC02000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013FC0103D  mov         eax,2  
000000013FC01042  mov         rdx,100000000h  
000000013FC0104C  movd        xmm0,eax  
000000013FC01050  xor         eax,eax  
000000013FC01052  pshufd      xmm1,xmm0,0  
000000013FC01057  movd        xmm0,rdx  
000000013FC0105C  nop         dword ptr [rax]  
000000013FC01060  cvtdq2pd    xmm2,xmm0  
000000013FC01064  paddd       xmm0,xmm1  
000000013FC01068  cvtdq2pd    xmm3,xmm0  
000000013FC0106C  paddd       xmm0,xmm1  
000000013FC01070  cvtdq2pd    xmm4,xmm0  
000000013FC01074  paddd       xmm0,xmm1  
000000013FC01078  cvtdq2pd    xmm5,xmm0  
000000013FC0107C  movaps      xmmword ptr arr[rax*8],xmm2  
000000013FC01081  paddd       xmm0,xmm1  
000000013FC01085  movaps      xmmword ptr [rsp+rax*8+60h],xmm3  
000000013FC0108A  movaps      xmmword ptr [rsp+rax*8+70h],xmm4  
000000013FC0108F  movaps      xmmword ptr [rsp+rax*8+80h],xmm5  
000000013FC01097  add         rax,8  
000000013FC0109B  cmp         rax,400h  
000000013FC010A1  jb          wmain+60h (13FC01060h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013FC010A3  lea         rcx,[t1]  
000000013FC010A8  call        qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013FC010AE  xor         eax,eax  
        for(int x = 0; x < 1024; x++){
000000013FC010B0  xor         edx,edx  
            var += arrPtr[x];
000000013FC010B2  lea         ecx,[rdx+rdx]  
        for(int x = 0; x < 1024; x++){
000000013FC010B5  inc         edx  
        for(int x = 0; x < 1024; x++){
000000013FC010B7  cmp         edx,200h  
            var += arrPtr[x];
000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]  
000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]  
        for(int x = 0; x < 1024; x++){
000000013FC010C9  jb          wmain+0B2h (13FC010B2h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013FC010CB  inc         eax  
000000013FC010CD  cmp         eax,0A00000h  
000000013FC010D2  jb          wmain+0B0h (13FC010B0h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013FC010D4  lea         rcx,[t2]  
000000013FC010D9  call        qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]

Intel Compiler without vectorization, 32-Bit and highest optimization (this one clearly is the winner now, runs in about 3 seconds and the assembly looks much better):

00B81088  lea         eax,[t1]  
00B8108C  push        eax  
00B8108D  call        dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]  
00B81093  xor         eax,eax  
00B81095  pxor        xmm0,xmm0  
00B81099  movaps      xmm1,xmm0  
        for(int x = 0; x < 1024; x++){
00B8109C  xor         edx,edx  
            var += arrPtr[x];
00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]  
00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]  
00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]  
00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]  
        for(int x = 0; x < 1024; x++){
00B810B6  add         edx,8  
00B810B9  cmp         edx,400h  
00B810BF  jb          wmain+9Eh (0B8109Eh)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
00B810C1  inc         eax  
00B810C2  cmp         eax,0A00000h  
00B810C7  jb          wmain+9Ch (0B8109Ch)  

    double var = 0; long long freq, t1, t2;
00B810C9  addpd       xmm0,xmm1  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
00B810CD  lea         eax,[t2]  
00B810D1  push        eax  
00B810D2  movaps      xmmword ptr [esp+4],xmm0  
00B810D7  call        dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]  
00B810DD  movaps      xmm0,xmmword ptr [esp]

I'm no expert in either language but you are comparing a straight array with a std::array class. Is it that surprising the straight array would be faster? — Mitch Wheat, Jan 19 '12 at 01:15
Ok I can post it here ;). Well I could look at the Java disassembly but it is a little bit more complicated... The point is that I don't understand anything what the Intel Compiler emits. It's very fancy... @Wheat: No, the std:array only allocates. As you can see it gets casted into a pointer... — thesaint, Jan 19 '12 at 01:16
@Greg Hewgill it is actually an interesring question: how to peek java's jitted code. — Andrey, Jan 19 '12 at 01:16
@Mitch: I would imagine that `std::array` is a very thin wrapper around a native array. — Oliver Charlesworth, Jan 19 '12 at 01:16
I'm not sure this would make a major difference, but is your processor 32 bit or 64 bit? In C++ the size of an int is platform-dependent, but Java defines an int to be 32 bits. — Chris Pitman, Jan 19 '12 at 01:17
@Chris: I would expect an `int` to be 32-bit in a 64-bit compiler. — Oliver Charlesworth, Jan 19 '12 at 01:19
Isn't there something like in NET? In NET you can use the Show Disassembly feature of Visual Studio, if I remember correctly. Maybe I will try that with C# tomorrow... Good point, I will try 64-bit again. — thesaint, Jan 19 '12 at 01:20
@Chris: That appears to be exactly it. The C++ code is compiled with `int` being 64 bits. @thesaint: Try printing `sizeof(int)` to see what kind of integers you're using. — Greg Hewgill, Jan 19 '12 at 01:22
@GregHewgill What indicates that `int` is 64 bits in the C++? — Daniel Fischer, Jan 19 '12 at 01:29
@DanielFischer size of int is not defined in standard. usually int is the most native numeric type to current system. — Andrey, Jan 19 '12 at 01:35
Maybe the Java Compiler knows Gauss' equation? :-) Actually, if this were true the Java time would be far less. — user949300, Jan 19 '12 at 01:36
@Java Compiler knows Gauss: Actually I kinda wonder why they fail to optimize this loop away ;)... — thesaint, Jan 19 '12 at 01:38
@Greg: Okay I will resolve your confusion. I removed it because I thought it would be useless any way. But I will put it back ;). Additionally with the MSVC output. sizeof(int) is going to be 4 bytes on x86 and as far as I know 8 bytes on x64 (but after all "int" is meant to be the fastest anyway, by the standard)... — thesaint, Jan 19 '12 at 01:43
You've posted an assembly from Intel compiler and then edited it away, so I can't quote from it. But it seems that this assembly relied heavily on the _128 bit_ operations with `xmm` registers. This might be the cause of low performance. — Pavel Zhuravlev, Jan 19 '12 at 01:45
@thesaint: `int` is *not* defined to be fastest by the standard, it's defined to be *native*. For simple operations this may be fastest, sure, but fastest depends on context. In this context, doubling the size of the data type doubles the amount of data you have to process, certainly not making it any faster. — GManNickG, Jan 19 '12 at 01:46
@thesaint: don't just guess, **print the value of `sizeof(int)` out**. — Greg Hewgill, Jan 19 '12 at 01:47
@Pavel: It is now back... Well maybe so ;), but then it is a failure of the Intel Compiler, because it shall generate fast code and not slow code. But maybe it is optimized for recent processors, I am on a Q6600, which might be too old ^^. But hell, it hasn't meet any task yet which would require a replacement... — thesaint, Jan 19 '12 at 01:47
Any way some of this could be cleared up, or taken to chat perhaps (assuming all the parties involved have enough rep)? — casperOne, Jan 19 '12 at 01:49
I'm voting to close this question as **not constructive**. This isn't a "C++ vs. Java" question, it's confusion about which architecture you're actually compiling for. This kind of fabricated "benchmark" is never useful. — Greg Hewgill, Jan 19 '12 at 01:49
@Greg: Well you were right. It is 4 bytes in either case :(. — thesaint, Jan 19 '12 at 01:50
@Greg: I don't see the confusion. Java on 32-Bit is still way faster... And beats the heck out of the Intel compiler in either case. I found this pretty interesting, maybe you not. But that sure does not apply to everyone. — thesaint, Jan 19 '12 at 01:51
@Andrey Not as far as I know. `int` is 32 bits on 64-bit linux and Windows, to my knowledge. — Daniel Fischer, Jan 19 '12 at 01:51
@casperOne: Or just leave it for a bit. I'm not sure what point there is for a comment section if they continually get deleted and moved around. Just close your eyes and wait until we resolve the issue and clarify the question, then come in and enforce any arbitrary rules you need. — GManNickG, Jan 19 '12 at 01:53
Yeah or maybe you should just change the title appropiately. I couldn't know upfront what I was getting at... — thesaint, Jan 19 '12 at 01:55
@Gman There is no indication that anything is or will be done, it's a request for the parties involved to be responsible and clean up after themselves if the comments are no longer relevant or if they are being used for the wrong reason. If the comments are being used for *extended discussion*, then they are being used for the wrong reason. For extended discussion, parties should go to [chat](http://chat.stackoverflow.com) where the discussion is public and transcribed and generally not deleted unless there's a seriously offensive message. — casperOne, Jan 19 '12 at 01:59
This is interesting, Intel do SSE2 and MSVC do plain x64.... and MSVC win. Maybe the Java JIT just dumb enough don't do the SSE stuff. — J-16 SDiZ, Jan 19 '12 at 02:06
@SDiz: Well, I really think it is rather a problem with processor support for the Intel compiler, but I can't validate that statement though. But it just wouldn't be right if it were so slow... — thesaint, Jan 19 '12 at 02:13

score 69 · Accepted Answer · edited Jun 20 '20 at 09:12

tl;dr: What you're seeing here seems to be ICC's failed attempt at vectorizing the loop.

Let's start with MSVC x64:

Here's the critical loop:

$LL3@main:
movsxd  rax, DWORD PTR [rdx-4]
movsxd  rcx, DWORD PTR [rdx-8]
add rdx, 16
add r10, rax
movsxd  rax, DWORD PTR [rdx-16]
add rbx, rcx
add r9, rax
movsxd  rax, DWORD PTR [rdx-12]
add r8, rax
dec r11
jne SHORT $LL3@main

What you see here is the standard loop unrolling by the compiler. MSVC is unrolling to 4 iterations, and splitting the var variable across four registers: r10, rbx, r9, and r8. Then at the end of the loop, these 4 registers are summed up back together.

Here's where the 4 sums are recombined:

lea rax, QWORD PTR [r8+r9]
add rax, r10
add rbx, rax
dec rdi
jne SHORT $LL6@main

Note that MSVC currently does not do automatic vectorization.

Now let's look at part of your ICC output:

000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]  
000000013F0510A8  add         r8,8  
000000013F0510AC  punpckldq   xmm2,xmm2  
000000013F0510B0  add         rcx,20h  
000000013F0510B4  movdqa      xmm3,xmm2  
000000013F0510B8  pand        xmm2,xmm0  
000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]  
000000013F0510C1  psrad       xmm3,1Fh  
000000013F0510C6  punpckldq   xmm4,xmm4  
000000013F0510CA  pand        xmm3,xmm1  
000000013F0510CE  por         xmm3,xmm2  
000000013F0510D2  movdqa      xmm5,xmm4  
000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]  
000000013F0510DB  psrad       xmm5,1Fh  
000000013F0510E0  punpckldq   xmm2,xmm2  
000000013F0510E4  pand        xmm5,xmm1  
000000013F0510E8  paddq       xmm6,xmm3  

...

What you're seeing here is an attempt by ICC to vectorize this loop. This is done in a similar manner as what MSVC did (splitting into multiple sums), but using SSE registers instead and with two sums per register.

But it turns out that the overhead of vectorization happens to outweigh the benefits of vectorizing.

If we walk these instructions down one-by-one, we can see how ICC tries to vectorize it:

//  Load two ints using a 64-bit load.  {x, y, 0, 0}
movq        xmm2,mmword ptr arr[rcx]  

//  Shuffle the data into this form.
punpckldq   xmm2,xmm2           xmm2 = {x, x, y, y}
movdqa      xmm3,xmm2           xmm3 = {x, x, y, y}

//  Mask out index 1 and 3.
pand        xmm2,xmm0           xmm2 = {x, 0, y, 0}

//  Arithmetic right-shift to copy sign-bit across the word.
psrad       xmm3,1Fh            xmm3 = {sign(x), sign(x), sign(y), sign(y)}

//  Mask out index 0 and 2.
pand        xmm3,xmm1           xmm3 = {0, sign(x), 0, sign(y)}

//  Combine to get sign-extended values.
por         xmm3,xmm2           xmm3 = {x, sign(x), y, sign(y)}
                                xmm3 = {x, y}

//  Add to accumulator...
paddq       xmm6,xmm3

So it's doing some very messy unpacking just to vectorize. The mess comes from needing to sign-extend the 32-bit integers to 64-bit using only SSE instructions.

SSE4.1 actually provides the PMOVSXDQ instruction for this purpose. But either the target machine doesn't support SSE4.1, or ICC isn't smart enough to use it in this case.

But the point is:

The Intel compiler is trying to vectorize the loop. But the overhead added seems to outweigh the benefit of vectorizing it in the first place. Hence why it's slower.

EDIT : Update with OP's results on:

ICC x64 no vectorization
ICC x86 with vectorization

You changed the data-type to double. So now it's floating-point. There's no more of that ugly sign-fill shifts that were plaguing the integer version.

But since you disabled vectorization for the x64 version, it obviously becomes slower.

ICC x86 with vectorization:

00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]  
00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]  
00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]  
00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]  
00B810B6  add         edx,8  
00B810B9  cmp         edx,400h  
00B810BF  jb          wmain+9Eh (0B8109Eh)

Not much here - standard vectorization + 4x loop-unrolling.

ICC x64 with no vectorization:

000000013FC010B2  lea         ecx,[rdx+rdx]  
000000013FC010B5  inc         edx  
000000013FC010B7  cmp         edx,200h  
000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]  
000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]  
000000013FC010C9  jb          wmain+0B2h (13FC010B2h)

No vectorization + only 2x loop-unrolling.

All things equal, disabling vectorization will hurt performance in this floating-point case.

+1 for this analysis ;). But then still the question remains, why the Intel Compiler doesn't know its slower. I mean we are not talking about some few %. This is a huge difference of almost 100% compared to MSVC! — thesaint, Jan 19 '12 at 02:21
+1, trying ICC without the optimizations may yield surprising results. — J.N., Jan 19 '12 at 02:22
@thesaint Compilers aren't perfect. Although they may outdo a human 99% of the time. They are still written by humans. So that doesn't mean they always make the best optimization choices. I've seen numerous cases where compilers (ICC, GCC, MSVC) have done some ***very stupid*** things that I can't explain. — Mysticial, Jan 19 '12 at 02:26
Ok but I just thought at least with such simple examples they shall provide the best (correct) answers... — thesaint, Jan 19 '12 at 02:35
I just tested this in ICC 11, and it's actually faster. What's going on here is that ICC 12 has a new (and powerful) transformation added to aid vectorization. It's pretty clear now that I've taken a look at the assemblies. It does things that I would normally have to do manually. Since this is new, it's possible that Intel hasn't finished tuning this feature. So you've hit a hiccup in the new optimization feature. — Mysticial, Jan 19 '12 at 02:38
I now disabled vectorization. The x64 build is now much slower but the x86 build wins the race (Intel Compiler 12.1). But this rather seems like a bug. — thesaint, Jan 19 '12 at 09:13
(If nothing surprising happens in the next day, I will mark yours as anwser) — thesaint, Jan 19 '12 at 09:44
Answer updated with your new results. Apparently you changed the data-type to `double`. So disabling vectorization in this case severely hurts performance. — Mysticial, Jan 19 '12 at 14:57
Oh sorry.. I was just playing around with other data types yesterday and seem to have forgotten to switch it back :(. But as far as I tried it, the timings were exactly the same for the Intel Compiler... I don't know where you get to that its slower, because it actually isn't, at least not here. I know its counterintuitive but a speedup would obviously require the Intel Compiler to do optimization correctly, and it seems to have trouble at this one. — thesaint, Jan 19 '12 at 16:06

score 18 · Answer 2 · answered Jan 19 '12 at 02:02

18

The example is simple enough that the different languages should not make a difference, and stupid enough not to prove anything. The loop could be optimised away by a compiler into a simple assignment, or left running for the whole number of iterations, or some of the iterations might be unrolled... I am not sure why you decided to write that test program, but it does not test anything regarding the languages, as once the logic optimisations are performed it all boils down to exactly the same assembly.

Also, regarding the performance of the intel compiler, it will greatly depend on the exact hardware and the compiler version. The compiler generates different versions of the code, and has a tendency to generate horrible code for AMD processors. Even for intel, if it does not recognise the specific processor it falls back into a safe slow mode.

answered Jan 19 '12 at 02:02

David Rodríguez - dribeas

204,818
23
294
489

That's a good point. I suppose intel has dropped support for Q6600 in their latest compiler. It just makes no sense otherwise. But maybe someone with a recent intel processor could check this out?! – thesaint Jan 19 '12 at 02:07
1

+1 for the most accurate answer: The benchmark in question is so silly as to be irrelevant. – Puppy Jan 19 '12 at 02:12
9

@DeadMg: I disagree. If a compiler is not able to generate optimized code for trivial examples, how could it for complicated ones? I mean there is nothing to it. Its quite the opposite. Java especially shines at code where there are no objects involved. As soon as things get nasty, Java breaks down. The same goes for other compilers. You can't master the complicated if you fail at the basics ^^. – thesaint Jan 19 '12 at 02:30
@thesaint you should aim to optimize your program and code patterns similar to your program. In your program the code used to instrument your measurement is greater than the code that you are instrumenting. The cost of the loop is greater than the cost of the add operation, and if the compiler, for example runs the inner loop and then transforms the outer into a multiplication it will generate very fast code for this particular example, but that will not guarantee that in any other case it will be effectivem not even if the number of iterations changes. Google for loop unrolling if you want. – David Rodríguez - dribeas Jan 19 '12 at 02:39
Also consider that with a simple assembly instruction there is not much to optimize (other than the loop) but in real code the compiler can detect patterns that you did not notice and rewrite your code, or use different combinations of instructions in assembly that yield the same result more efficiently. Unless your particular problem is adding numbers then the test case is basically useless – David Rodríguez - dribeas Jan 19 '12 at 02:55
@thesaint: No, because no real program will have such bottlenecks, so why would compilers write optimization passes for such things? – Puppy Jan 19 '12 at 03:17
3

@DeadMg: Well the point is that this is requires no special case optimization. This loop is so damn simple that the compiler can shine by using loop unrolling, automatic vectorization, parallelization and stuff like that. And if all that failes for those three loop variables without much logic involved, then it would be quite surprising and unlogical if it worked any better for even more variables and more complicated logic. The compiler doesn't need to be optimized for this. This is a common pattern, which compilers need to be able to detect and handle. Thats it... – thesaint Jan 19 '12 at 08:53
Sure no usual program looks like that. But all except that the iteration count is unusual high, this is a regular loop with not specialty of being "artificial". – thesaint Jan 19 '12 at 08:54
It might be interesting to try GCC, which in my experience in practice often generates faster code despite ICC's formiddable reputation. – Eamon Nerbonne May 18 '13 at 15:08

score 11 · Answer 3 · answered Jan 19 '12 at 01:20

11

when you have eliminated the impossible, whatever remains, however improbable, must be the truth.

You've got some data in one hand, and an assumption (C++ is always faster than Java) in the other. Why ask for people to justify your assumption when the data tells you otherwise?

If you wish to obtain assembly from the JVM in order to compare what's being run then the commandline option is '-XX:+PrintOptoAssembly', but you'll need to download a debug jvm in order to do so. Looking at the assembly would at least tell you why one is faster than the other.

answered Jan 19 '12 at 01:20

Richard Warburton

1,462
1
9
13

9

People are so insistent that Java is slow, but it hasn't been significantly different since they invented the JIT a dozen years ago. – Karl Jan 19 '12 at 01:27
4

+1 for Sherlock Holmes quote. – blahman Jan 19 '12 at 01:27
1

Its not about that Java shall be slow ;). We are talking about Java being 40% faster and this is what can't be right! At maximum it shall be as fast as C++ and even that would be surprising and in fact it isn't (see above my edit in a few seconds) – thesaint Jan 19 '12 at 01:30
+1 for '-XX:+PrintOptoAssembly – thesaint Jan 19 '12 at 01:41
6

MSVC x64 is faster than Java. In this occasion the results indicate more about compilers than languages. – Tim Cooper Jan 19 '12 at 02:03
@Karl: This trivial example proves or disproves absolutely nothing. – Puppy Jan 19 '12 at 02:12
@Cooper: Well of course we are not talking about the syntax here. You could natively compile Java just as you could compile C++ to a VM (like C++.NET)... The only that matters is the maximum performance you can get with the tools of choice. So if MSVC produces faster code on 64-bit than Java is slower; but thumbs up that it is that fast at all ^^. I don't see the point. – thesaint Jan 19 '12 at 02:15
@DeadMG: Well yes ^^. But by experience I know that Java doesn't get faster if you get to the fun stuff. Oh I think you meant it that way anyway. – thesaint Jan 19 '12 at 02:16
@Karl: Java is plenty fast for computational benchmarks like this. If only the JVM itself didn't have to be loaded from disk :( – Billy ONeal Jan 19 '12 at 03:15
1

@TimCooper - these benchmarks nearly always tell you more about compilers than languages. The poster doesn't seem to realise this, and refers to 'Java', where as in reality he should be referring to hotspot or whatever the implementation is. – Richard Warburton Jan 19 '12 at 13:27

score 6 · Answer 4 · answered Jan 19 '12 at 01:55

6

Just for the record, I ran both codes on my box (x86_64 linux), the C++ with std::array, a plain int[1024] and, for completeness also with long instead of int. Java (open-jdk 1.6) clocked it at 3.8s, C++ (int) at 3.37s, and C++ (long) at 3.9s. My compiler was g++ 4.5.1. Maybe it's just Intel's compiler that's not as good as thought.

answered Jan 19 '12 at 01:55

Daniel Fischer

181,706
17
308
431

Nice to know that Linux has catched up ;). Last time I compared GCC with MSVC (quite some years ago) it was nowhere near the performance... – thesaint Jan 19 '12 at 02:09
1

@thesaint: Actually I find it more supprising that MSVC does so good, because just about everything I could find about that topic indicates that MSVC is much worse at optimizing code then gcc and icc (between those two gcc seems to win by a small margin lately). Of course when it comes to performance of small samples there can be huge differences if one compiler misses one specific optimization or doesn't know enough about the timings on the targetplattform – Grizzly Jan 19 '12 at 02:19
@thesaint : Daniel didn't compare MSVC on *his* machine, it *may* be faster than GCC. Like other comments mentions, that example is too simple to prove stuff. May be you should try a more complex algorithm and more constraints (i.o., objects ...) – J.N. Jan 19 '12 at 02:20
I will sure do that later. This was just some simple test for something totally different (designing a cache-aware object container) and if the Intel Compiler would not have performed so terribly slow, I would never have started this thread ;) – thesaint Jan 19 '12 at 02:27
@Grizzly Speaking of MSVC doing unusually well at optimizations... This might be an interesting read: http://stackoverflow.com/questions/8389648/how-to-achieve-4-flops-per-cycle – Mysticial Jan 19 '12 at 02:41

score 0 · Answer 5 · answered Jan 19 '12 at 01:33

0

I think Java compiler implements JITC (Just in time compilation, or some more recent technology) to approach native compilers speed, and could infer that your array doesn't change, and thus could apply constant folding to the inner loop...

answered Jan 19 '12 at 01:33

CapelliC

59,646
5
47
90

The Intel compiler knows all that too. And JITC is overrated since many people seem to forget that C++ has Link-Time-Code-Generation as well as Profile-Guided-Optimizations which in fact do the job as good as any thinkable Runtime these days... – thesaint Jan 19 '12 at 01:40
2

@thesaint: Not quite, JIT can do things like inlining virtual calls that no amount of compile-time optimization can do. It requires run-time compilation in principle. (This is of course irrelevant to the question.) – GManNickG Jan 19 '12 at 01:44
2

@GMan: I beg to differ. Profile-Guided-Optimization does exactly this. It will "see" which object instances are likely for every virtual call and emit proper code that reflects this knowledge. Just like any VM. – thesaint Jan 19 '12 at 01:49
4

@thesaint: False. There's simply no way for this code to be optimized at compile-time to be faster than a compilation at run-time: `Base* x = load_some_dynamic_thing(); for (;;) { x->dynamic_dispatch(); }`. No matter how good your compile-time optimization is, you'll never get rid of that virtual function call, because the compiler cannot guess what the run-time is going to do (consider dynamic linking and whatnot). Contrarily, a VM knows exactly what its going to do because it's running it. It can inline the code and recompile it. – GManNickG Jan 19 '12 at 01:58
1

I don't think it makes much sense to argue about that. Any modern C++ compiler with Profile-Guided optimization will instrument the suspicious "x->dynamic()" call and see (or maybe not but then the Java VM is out of luck too) that the object pointet to by "x" is most of the time of some type "T". Then it will inline some branch prediction or let's say a predicted "hotpath" that assumes the object type of "x" is "T". And the virtual call penalty only applies if x is of some type other than "T". – thesaint Jan 19 '12 at 02:03
What I agree with though is that I you are loading plugins at runtime which are not included in the PGO, then the C++ compiler is out of luck while Java could still apply these optimizations. But for any static application with a well defined use case pattern, Java has no advantage. – thesaint Jan 19 '12 at 02:05
@GMan: It *can* do that. In theory. But every cycle and byte the JIT spends, the program can't spend, meaning that in reality, the C++ compiler can perform far more optimization (*in general*) since it has effectively infinite time and space in which to run. In addition, many of Java's language semantics simply enforce that it must be slower- for example, member variables being pointers instead of allocated directly. The reality is that these factors far outweigh the minimal advantages the JIT has in inlining virtual calls. – Puppy Jan 19 '12 at 02:08
2

An example to support GMans standpoint: consider a game with three difficulty levels, and an implementation that for each difficulty level/enemy type produces an object of a different type conforming to an interface. At compile time you cannot know what difficulty level will be chosen, but once a game starts the JIT is seeing that only `EasyBeast` objects are in use. Again, profile based optimization cannot optimize that either, as the developers would profile games at different difficulty levels. – David Rodríguez - dribeas Jan 19 '12 at 02:14
1

@David: Good point. But this is all not so productive, since usually you have some hotspots in your app and the rest doesn't matter. And in hotspots you can simply eliminate such bottlenecks. In java you wouldn't have to and that is an advantage. But the price is too high nowadays. But I am sure in the future, VMs will be the way to go! – thesaint Jan 19 '12 at 02:19
1

@DeadMG the language does not require that all members are hanled by reference or even that all objects dynamically allocated are in the heap. Microsoft C# compiler (I don't know this particular for Java compilers) will allocate reference types in the stack if scape analysis guarantees that there will be no other reference than the locals. There is a lot of work by very smart people into optimizers ensuring what is known as the *as-if* in C++. Languages define the semantics, not the implementations. – David Rodríguez - dribeas Jan 19 '12 at 02:22
@David: The point he was trying to make is probably that Java leaves this to the compiler, while C++ allows you to specify every detail of the guarantees you can give to the compiler and it can take that into account (like strict aliasing). Without such guarantees, which are part of the language and the semantic, a compiler can NOT optimize as much as it could with these information. And as long as computers are not intelligent, this will not change... – thesaint Jan 19 '12 at 02:25
@thesaint And wouldn't it make sense to assume that bottlenecks in a game might come while processing the enemies? Or in a different application in a piece of code that handles similar types of objects based on a configuration parameter or user input? – David Rodríguez - dribeas Jan 19 '12 at 02:29
@David: Ok well let's just say, yes it could. But it is rather going to be a matter of convenience if you don't properly opmtimize the C++ code for these cases. – thesaint Jan 19 '12 at 02:37

score 0 · Answer 6 · answered Jan 19 '12 at 01:54

0

I suspect the culprit is simple loop unrolling. Replace

var += arrPtr[x];

with

var += arrPtr[x++];
var += arrPtr[x++];
var += arrPtr[x++];
var += arrPtr[x];

and observe how much faster the C++ version runs.

answered Jan 19 '12 at 01:54

Kyle Jones

5,492
1
21
30

4

If you look at the assembly output, you will see that this is already done, even though it is very obscure... – thesaint Jan 19 '12 at 02:00

score 0 · Answer 7 · answered Jan 19 '12 at 03:39

0

i see you are running the following loop

for(int i = 0; i < 1024 * 1024; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }

twice in the Java code; while once in the c++ code; this might bring a caches warmup which makes the Java code finally execute faster than the C++.

answered Jan 19 '12 at 03:39

sramij

4,775
5
33
55

1

Running it twice is to get it hit by the JIT. Normally you would want to warm up the cache in C too, but in this case the difference is far too large to be explained by cache alone. – Dietrich Epp Jan 19 '12 at 05:10
Cache misses are not the problem, I ran this through VTune Analyzer. Additionally, the for-loop takes much to long as that a little cache warmup could explain these results. And as you can see now, it doesn't. This is plain and simple, a compiler failure! – thesaint Jan 19 '12 at 09:42

C++ vs Java? Why does the ICC generate slower code than VC?

7 Answers7

EDIT : Update with OP's results on:

Linked