1

EDIT: considering the first answer I removed the "myexp()" function as with bug and not the main point of the discussion

I have one simple piece of code and compiled for different platform and get different performance results (execution time):

  • Java 8 / Linux: 3.5 seconds

    Execution Command: java -server Test

  • C++ / gcc 4.8.3: 6.22 seconds

    Compilation options: O3

  • C++ / Visual Studio 2015: 1.7 seconds

    Compiler Options: /Og /Ob2 /Oi

It seems that VS has these additional options not available for g++ compiler.

My question is: why is Visual Studio (with those compiler options) so faster with respect to both Java and C++ (with O3 optimization, which I believe is the most advanced)?

Below you can find both Java and C++ code.

C++ Code:

#include <cstdio>
#include <ctime>
#include <cstdlib>
#include <cmath>


static unsigned int g_seed;

//Used to seed the generator.
inline void fast_srand( int seed )
{
    g_seed = seed;
}

//fastrand routine returns one integer, similar output value range as C lib.
inline int fastrand()
{
    g_seed = ( 214013 * g_seed + 2531011 );
    return ( g_seed >> 16 ) & 0x7FFF;
}

int main()
{
    static const int NUM_RESULTS = 10000;
    static const int NUM_INPUTS  = 10000;

    double dInput[NUM_INPUTS];
    double dRes[NUM_RESULTS];

    fast_srand(10);

    clock_t begin = clock();

    for ( int i = 0; i < NUM_RESULTS; i++ )
    {
        dRes[i] = 0;

        for ( int j = 0; j < NUM_INPUTS; j++ )
        {
           dInput[j] = fastrand() * 1000;
           dInput[j] = log10( dInput[j] );
           dRes[i] += dInput[j];
        }
     }


    clock_t end = clock();

    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;

    printf( "Total execution time: %f sec - %f\n", elapsed_secs, dRes[0]);

    return 0;
}

Java Code:

import java.util.concurrent.TimeUnit;


public class Test
{

    static int g_seed;

    static void fast_srand( int seed )
    {
        g_seed = seed;
    }

    //fastrand routine returns one integer, similar output value range as C lib.
    static int fastrand()
    {
        g_seed = ( 214013 * g_seed + 2531011 );
        return ( g_seed >> 16 ) & 0x7FFF;
    }


    public static void main(String[] args)
    {
        final int NUM_RESULTS = 10000;
        final int NUM_INPUTS  = 10000;


        double[] dRes = new double[NUM_RESULTS];
        double[] dInput = new double[NUM_INPUTS];


        fast_srand(10);

        long nStartTime = System.nanoTime();

        for ( int i = 0; i < NUM_RESULTS; i++ )
        {
            dRes[i] = 0;

            for ( int j = 0; j < NUM_INPUTS; j++ )
            {
               dInput[j] = fastrand() * 1000;
               dInput[j] = Math.log( dInput[j] );
               dRes[i] += dInput[j];
            }
        }

        long nDifference = System.nanoTime() - nStartTime;

        System.out.printf( "Total execution time: %f sec - %f\n", TimeUnit.NANOSECONDS.toMillis(nDifference) / 1000.0, dRes[0]);
    }
}
Sandro
  • 13
  • 4
  • 4
    How did you test your performance? How many loops did you test? Did you do a warm-up? Java is a VM-based, non-native language, and as such there is overhead for the JVM loading and some time is required for the optimizer to kick in. Did you take that into account? – RealSkeptic Dec 14 '16 at 10:29
  • The main point is not java warm up (consider that the loop is executed 100 million times and I also executed the same code multiple time without any difference in the result). Java and C++ are already comparable on Linux. I was wondering why VS is so faster – Sandro Dec 14 '16 at 10:45
  • I assume that you have different hardware on the linux system and the windows system? Could you run the Java program on the Windows system, just for comparison's sake? – Rene Dec 14 '16 at 10:52
  • actually the windows system is my machine, while linux is a server 16 cores. I tried already on my machine with the same result. moreover, I also profiled the java application and tried different performance jvm arguments (code cache, heap size, compile threshold) – Sandro Dec 14 '16 at 10:56
  • Your benchmark is bogus as a very smart compiler, such as modern C++ compiler, can, to a various degree, detect that the input is static and either prove that the output does not depend on input and thus allowed to optimize out everything, leaving only single `rand()` that influences print `dRes[0]`, or to calculate output on compile time. For proper measurements, you need to pass the arguments to your program on runtime. – Ivan Aksamentov - Drop Dec 14 '16 at 11:11
  • I agree, that's why I performed the test passing the seed of the random a value taken at runtime (like clock()) with the same result. the point here is: if you are correct, why java would not be able to perform such optimization? at the end, from an end user perspective (for example my client), how can I explain that? he just sees one slower than the other. I hope you see my point. – Sandro Dec 14 '16 at 14:09
  • Regarding the speed comparison when the `exp()` is used, see the edit of my answer. Apparently both MSVC and GCC optimize by SSE2/vectorization/loop unrolling, whereas Java most likely doesn't. – EmDroid Dec 14 '16 at 14:34

1 Answers1

5

The function

static inline double myexp( double val )
{
    const long tmp = (long)( 1512775 * val + 1072632447 );
    return double( tmp << 32 );
}:

gives the warning in MSVC

warning C4293: '<<' : shift count negative or too big, undefined behavior

After changing to:

static inline double myexp(double val)
{
    const long long tmp = (long long)(1512775 * val + 1072632447);
    return double(tmp << 32);
}

the code also takes around 4 secs in MSVC.

So, apparently the MSVC optimized a whole lot of stuff out there, possibly the entire myexp() function (and maybe even something else depending on this result as well) - because it can (remember, undefined behavior).

The lesson taken: Check (and fix) the warnings as well.


Note that if I try to print the result in the func, the MSVC optimized version gives me (for every call):

tmp: -2147483648
result: 0.000000

I.e. the MSVC optimized the undefined behavior to always return 0. Might be also interesting to see the assembly output to see what else has been optimized out because of this.


So, after checking the assembly, the fixed version has this code:

; 52   :             dInput[j] = myexp(dInput[j]);
; 53   :             dInput[j] = log10(dInput[j]);

    mov eax, esi
    shr eax, 16                 ; 00000010H
    and eax, 32767              ; 00007fffH
    imul    eax, eax, 1000
    movd    xmm0, eax
    cvtdq2pd xmm0, xmm0
    mulsd   xmm0, QWORD PTR __real@4137154700000000
    addsd   xmm0, QWORD PTR __real@41cff7893f800000
    call    __dtol3
    mov edx, eax
    xor ecx, ecx
    call    __ltod3
    call    __libm_sse2_log10_precise

; 54   :             dRes[i] += dInput[j];

In the original version, this entire block is missing, i.e. the call to log10() has been apparently optimized out as well, and replaced by a constant at the end (apparently -INF, which is result of log10(0.0) - in the fact the result might be also undefined or implementation defined). Also, the entire myexp() function was replaced by fldz instruction (basically, "load zero"). So that explains the extra speed :)


EDIT

Regarding the performance difference when using the real exp(): The assembly output might give some clues.

In particular, for MSVC you can utilize those additional parameters:

/FAs /Qvec-report:2

/FAs produces the assembly listing (along with the source code)

/Qvec-report:2 provides useful information about the vectorization status:

test.cpp(49) : info C5002: loop not vectorized due to reason '1304'
test.cpp(45) : info C5002: loop not vectorized due to reason '1106'

The reason codes are available here: https://msdn.microsoft.com/en-us/library/jj658585.aspx - in particular, the MSVC seems to not be able to vectorize the loops properly. But according to the assembly listing, it still uses the SSE2 functions (which is still kind of "vectorization", improving the speed significantly).

The similar parameters for GCC are:

-funroll-loops -ftree-vectorizer-verbose=1

Which gives the result for me:

Analyzing loop at test.cpp:42
Analyzing loop at test.cpp:46
test.cpp:30: note: vectorized 0 loops in function.
test.cpp:46: note: Unroll loop 3 times

So apparently g++ is not able to vectorize either, but it does loop unrolling (in the assembly I can see that the loop code is duplicated 3 times there), which can also explain the better performance.

Unfortunately, this is where Java lacks AFAIK, because Java does not do any vectorization, SSE2 or loop unrolling, therefore it is then much slower than the optimized C++ version. See e.g. here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions? where the JNI is recommended for better performance (i.e., calculating in C/C++ DLL through JNI interface for the Java app).

Community
  • 1
  • 1
EmDroid
  • 5,918
  • 18
  • 18
  • Does VS give the same warning when compiled as 64bit? – WhozCraig Dec 14 '16 at 10:53
  • Not sure, I didn't try (I tested it with VS2013). But AFAIK `long` is 4B in 64-bit on Windows as well (LLP64). – EmDroid Dec 14 '16 at 10:55
  • But actually g++ 64-bit has the `long` of 8B, so that might be the reason it works there. – EmDroid Dec 14 '16 at 11:00
  • Thanks for the reminder. I now remember yet another reason why I love my Mac =P – WhozCraig Dec 14 '16 at 11:00
  • you shouldn't blame the platform when you're using implementation-specific types (C++ standard only says `long` is *at least* 32bits, possibly more), if you wanted consistent cross-platform behavior you should use fixed-width like `int64_t`. see http://stackoverflow.com/a/13604190/1362755 – the8472 Dec 14 '16 at 12:32
  • you are correct. But now, If I perform the test changing the myexp(double) function with the library math exp function, which is: exp() for c++ and Math.exp() for java, I get the following execution time: Java (40 seconds), C++ (linux) 3 seconds, C++ (windows): 9 seconds. How do you explain that? – Sandro Dec 14 '16 at 12:52
  • thanks a lot for the answer. So, this doesn't seem to be related to the math computation rather on the loop (vectorization) itself. Here is another reference: http://bugs.java.com/view_bug.do?bug_id=7192383 – Sandro Dec 14 '16 at 16:46
  • @axalis you shouldn't just look at the accepted answer of the linked question since it can be be outdated. hotspot c2 *does* implement loop unrolling and superword parallelism. But such optimizations are always dependent on which patterns they can find and exploit, not all compilers are equally strong there and they improve over time, i.e. jdk9 will probably vectorize more loops than jdk8. http://stackoverflow.com/a/17142855/1362755 – the8472 Dec 14 '16 at 20:27
  • Exactly, infact I tried with that jvm flag (on/off) without any difference in execution time. I have also found this bug (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8148754). I will give a try with jdk9 – Sandro Dec 14 '16 at 22:13
  • I confirm with Java 9 I am gaining 34%: on my current machine I am going from 2.5 seconds (java 8) to 1.9 seconds (java 9) – Sandro Dec 14 '16 at 22:41