Disappointing performance in Ubuntu for computational workload

Question

I've found rather poor performance running some computational code under Ubuntu on a brand new headless workstation machine I'm using for scientific computation. I noticed a difference in speed running some slightly complex code on Ubuntu versus on my old Mac laptop which I use for development. However, I've managed to distill it down to an incredibly simple example which still exhibits less than stelar improvements over my old machine:

#include <stdio.h>
#include <math.h>

int main() {
        double res = 0.0;
        for(int i=1; i<200000000; i++) {
                res += exp((double) 100.0/i);
        }
        printf("%lf", res);
        return(0);
}

Now the Mac is a nearly 5 year old 2.4GHz Core 2 Duo MacBook Pro running OS X 10.5 which runs this code in about 6.8 secs. However, on a brand new 3.4GHz Core i7 Dell running Ubuntu 11.10 it takes about 6.1 secs! Can someone enlighten me as to what is going on here, because it is absurd that a nearly 5 year old laptop is within 10% of a brand new desktop workstation? It is even more absurd because I can see the Core i7 turbo-boosting to nearly 4GHz with monitoring tools!

Mac compiled with:

gcc -o test test.c -std=gnu99 -arch x86_64 -O2

Ubuntu compiled with:

gcc -o test test.c -std=gnu99 -m64 -O2 -lm

Thanks,

Louis

Without the assembler code given this is useless. The compilers might output different code (because of different library implementations). It would be much better to have an assembler level benchmark code to guarantee run time. — Nobody moving away from SE, Feb 23 '12 at 17:08
I understand what you're saying, but I'm not going to be able to code my scientific application in assembler. I don't doubt that the raw hardware is faster: my issue is that the new workstation performs poorly with compiled C code as shown and I'd like assistance understanding how this can come about. In other words: what do I have to do to get the new workstation to post performance numbers more in line with the 5 years of technological evolution that has passed between the Core 2 Duo and Core i7? — L Aslett, Feb 23 '12 at 17:27
@user1055918 The compiler produces the assembly we're after -- we're not asking you to write assembly. OS X was late to Intel CPUs -- they do a lot of things differently (e.g. they can assume certain instruction sets exist). As well, the libraries may be different, or they may behave (slightly) differently. — justin, Feb 23 '12 at 17:43
@Justin Thanks -- yes it's looking like the entire implementation is different on each platform, though I'd assumed Apple would use the GNU maths libraries being a Unix base -- obviously not! I get numerically identical answers though, so how do I go about getting a maths library which helps my Ubuntu box live up to its potential is the question I guess? — L Aslett, Feb 23 '12 at 18:22
Just another remark, your program also has an unspecified result, since you are not initializing `res`. So the compiler is basically allowed to skip your `for` loop and output whatever he likes. Maybe the gcc version on OS X does that? — Jens Gustedt, Feb 23 '12 at 18:29
On OS X, you're likely using SSE/SIMD; that may not be the case on Linux. If you're using the FPU on Linux, you may get a slightly different result, and it could take much longer to calculate. The assembly would really help. If it's resorting to a library rather than an intrinsic, then the library may have a lot of work to get an ideal result, whereas an intrinsic is suitable for most cases. — justin, Feb 23 '12 at 18:31
specifically, time/accuracy of many transcendentals is not linear to the number of bits -- many take more insns to calculate as the bit counts increase (e.g. double may take more than twice as long as float). SSE will operate on the value at 64 bits, potentially using multiple highly optimized instructions. If done in the FPU... that really could take twice as long. — justin, Feb 23 '12 at 18:36
@JensGustedt Oh dear /blush! Sorry ... fixed it to initialise res to 0.0, same timings. I've uploaded the assembler for the mac and linux versions to here: [test_mac.asm](http://www.louisaslett.com/personal/test_mac.asm) and [test_ubuntu.asm](http://www.louisaslett.com/personal/test_ubuntu.asm) Thanks for looking into this! — L Aslett, Feb 23 '12 at 18:49
well the outline of both looks very similar. the diffence is that the "ubuntu" version is using vector instructions for the sse unit, e.g `vaddsd`, and the OS X is using `addsd`. But I'd guess the difference should be more the compiler version than the OS, unless the difference would be in the call to `exp`. What versions are these? — Jens Gustedt, Feb 23 '12 at 19:28
@JensGustedt: `vaddsd` and `addsd` are both scalar operations. `vaddsd` is the AVX equivalent of the SSE `addsd` instruction. — Stephen Canon, Feb 23 '12 at 19:52
right - they both just call into the external c `exp` functions, where most of the time is spent. — justin, Feb 23 '12 at 20:05
@JensGustedt re the versions question: On Ubuntu ldd shows it is using `/lib/x86_64-linux-gnu/libm.so.6` which is in the [libc6 package](http://packages.ubuntu.com/oneiric/libc6), whilst on the Mac otool -L shows it is using `/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 111.1.7)` (I think I'm right in thinking libm is inside libSystem on the Mac?) — L Aslett, Feb 23 '12 at 20:22
@user1055918: yes, libm is a sub-library of libSystem on OS X. — Stephen Canon, Feb 23 '12 at 20:57

score 3 · Answer 1 · answered Feb 23 '12 at 17:40

3

it is absurd that a nearly 5 year old laptop is within 10% of a brand new desktop workstation

Bear in mind that you are benchmarking one specific function (exp). We don't really know if the two implementations of the exp() function that you're benchmarking are identical (it is not inconceivable that one is better optimized than the the other).

If you were to benchmark a different function, the results could be quite different (perhaps more in line with your expectations; or not).

If exp() is really the bottleneck of your actual application, one possibility is to look into using a fast approximation. Here is a paper that offers one such approximation: A Fast, Compact Approximation of the Exponential Function.

answered Feb 23 '12 at 17:40

NPE

486,780
108
951
1,012

It's also very possible that one implementation of `exp` is completely **wrong**, i.e. gives inaccurate results for many inputs. Bad math libraries are more common than you think, especially with the tension between people who want correct results for scientific computing and the gamer kiddies who want their 3D games to run as fast as possible and don't care if the display or physics are slightly to moderately wrong. – R.. GitHub STOP HELPING ICE Feb 23 '12 at 17:58
@aix Thanks, yes I guess I had assumed Apple were using the standard GNU maths library with it being a Unix base in OS X, but it seems probably not. However, I am quite surprised that such a mature function such as exp() would have such significantly poorer performance under Linux. I get numerically identical answers, so I don't think the Mac implementation is sacrificing accuracy. – L Aslett Feb 23 '12 at 18:19
@user1055918: OS/X has to support a very limited range of CPUs, whereas Linux has to support a much wider range. One possibility is that on OS/X, `libm.a` is compiled to make use of some recent hardware features which the Linux version can't use while remaining widely portable. – NPE Feb 23 '12 at 18:22
@aix: In this case, the OS X library can't be taking advantage of hardware features that are *too* very recent, since the CPU in question is 5 years old. IIRC, x86_64 implies SSE2, so the only feature that could possibly be used beyond the baseline is [S]SSE3, which is not really helpful in implementing exp( ). – Stephen Canon Feb 23 '12 at 19:09
2

@R..: interestingly, in my career as a library developer, the scientific computing crowd have been some of the very worst offenders in requesting that corners be cut for improved performance. As a former mathematical computing guy, I was totally shocked by this. (Not that the game devs are a lot better, mind). – Stephen Canon Feb 23 '12 at 19:20
@R..: I find there's a certain (small) group of scientific computing guys who care tremendously about accuracy right up to the point of whatever accuracy the numerical method *they* are using requires. Beyond that point, the don't care in the slightest, and that's just your tough luck if whatever computation *you're* doing requires better results. – Stephen Canon Feb 23 '12 at 19:50

score 1 · Answer 2 · answered Feb 23 '12 at 18:57

1

As others noted, you're simply benchmarking one math library implementation of exp( ) against another. If you need high-quality math libraries on Linux, I would suggest looking at Intel's compiler tools (which come with an excellent set of libraries); they are also available for OS X and Windows.

answered Feb 23 '12 at 18:57

Stephen Canon

103,815
19
183
269

Yes, I get the impression that there's convergence of opinion here that it is down to the system libraries. This is something of a surprise to me as I naively thought that something like calculating exp() would be optimised into oblivion by default. It is just one bottleneck which I was able to pin down, but I'm sure my real code will have others. Are there any free maths libraries (a-la ATLAS in the BLAS space) that speeds up maths operations? I don't think I can stretch to expensive (?) Intel compilers! – L Aslett Feb 23 '12 at 20:40
@user1055918: Not off the top of my head; the ICC libraries are the only ones on Linux I have any personal experience with and would feel comfortable recommending. Hopefully someone else can point you in a good direction. – Stephen Canon Feb 23 '12 at 21:03

score 1 · Answer 3 · answered Feb 24 '12 at 23:17

1

Try turning on the -ffast-math option. This might give you a much less pedantically correct implementation of exp(). The question then is whether you want the potentially wrong answer that can produce.

answered Feb 24 '12 at 23:17

Phil Miller

36,389
13
67
90

score 0 · Answer 4 · answered Feb 23 '12 at 17:49

0

You are comparing apples and oranges, for Mac you allow for architecture specific optimizations which you don't for ubuntu. Use -O3 -march=native on both to have a fair comparison.

answered Feb 23 '12 at 17:49

Jens Gustedt

76,821
6
102
177

1

Out of interest I've tried these options on my Ubuntu box, and they made no perceptible difference to performance. – NPE Feb 23 '12 at 18:03
1

Thanks Jens -- I tried that and also tried -march=corei7, but no difference in speed, so I think the issues is from elsewhere in this instance. – L Aslett Feb 23 '12 at 18:20

score 0 · Answer 5 · answered Feb 23 '12 at 18:43

0

A few things to try:

Make sure your CPU is set to run fixed at its full speed during the experiment. It may be toggling up and down, which adds a lot of overhead
Pin the test program to one core using taskset, so that the OS scheduler doesn't migrate it around

answered Feb 23 '12 at 18:43

Phil Miller

36,389
13
67
90

Thanks, didn't know about `taskset` ... I just learned a new Linux tool. Unfortunately it doesn't show up a difference in timing though. I used cpufreq-set to change the governor to performance and there's not a statistically noticeable change really. Nice ideas to try though, thanks. – L Aslett Feb 23 '12 at 20:32

score 0 · Answer 6 · answered Feb 23 '12 at 19:06

0

the difference in number of cpu cycles is just 30%. Given that we don't know exactly what code the compiler generated I would not say it's absurd. Most of the performance gain with your new cpu is the number of cores, and your code does not make use of that.

It may also be interesting to try and unroll the loop. The speed ratio may change.

int main() {
    double res0 = 0.0;        
    double res1 = 0.0;        
    double res2 = 0.0;        
    double res3 = 0.0;        
    double res4 = 0.0;        
    for(int i=1; i<200000000; i+=5) {
            res0 += exp((double) 100.0/i);
            res1 += exp((double) 100.0/(i+1));
            res2 += exp((double) 100.0/(i+2));
            res3 += exp((double) 100.0/(i+3));
            res4 += exp((double) 100.0/(i+4));
    }
    double res=res0+res1+res2+res3+res4;
    printf("%lf", res);
    return(0);
}

answered Feb 23 '12 at 19:06

Johan Lundberg

26,184
12
71
97

3

Your "unrolled" loop actually changes the behavior due to the fact that addition is not associative. – R.. GitHub STOP HELPING ICE Feb 23 '12 at 19:34
If you'd like to try things like this you'd better directly move to OpenMP, gcc implements that well. I just tried and it gives me a speedup of 3.29. – Jens Gustedt Feb 23 '12 at 19:40
Thanks. In reality though the example I've provided is contrived simply to show something that (in my limited experience) I found surprising: that I'm having trouble getting any bang-for-buck out of the new workstation which was supposed to speed up my work more than 10%! The work I'm doing is not easily parallelisable, hence going for the fastest Core i7 possible, rather than, say, a lower clocked dual 6-core Xeon. I guess @StephenCanon is indicating my hopes of getting more than cycle-for-cycle improvement, or at least more than 10% improvement is not unrealistic to have hoped? – L Aslett Feb 23 '12 at 20:29
@R.., sure that's true. But since they are there only as an example I think we can live with that. – Johan Lundberg Feb 23 '12 at 21:27
@user1055918: I think you'd get a lot more bang for your buck by figuring out your precision requirements and writing or adopting your own `exp` function optimized for your particular usage case instead of buying new machines. – R.. GitHub STOP HELPING ICE Feb 23 '12 at 21:48

Disappointing performance in Ubuntu for computational workload

6 Answers6