64-bit executable runs slower than 32-bit version

Question

I have a 64-bit Ubuntu 13.04 system. I was curious to see how 32-bit applications perform against 64-bit applications on a 64-bit system so I compiled the following C program as 32-bit and 64-bit executable and recorded the time they took to execute. I used gcc flags to compile for 3 different architectures:

-m32: Intel 80386 architecture (int, long, pointer all set to 32 bits (ILP32))
-m64: AMD's x86-64 architecture (int 32 bits; long, pointer 64 bits (LP64))
-mx32: AMD's x86-64 architecture (int, long, pointer all set to 32 bits (ILP32), but CPU in long mode with sixteen 64b registers, and register call ABI)

// this program solves the
// project euler problem 16.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <sys/time.h>

int sumdigit(int a, int b);

int main(void) {
    int a = 2;
    int b = 10000;
    struct timeval start, finish;
    unsigned int i;
    gettimeofday(&start, NULL);
    for(i = 0; i < 1000; i++)
        (void)sumdigit(a, b);
    gettimeofday(&finish, NULL);
    printf("Did %u calls in %.4g seconds\n", 
            i, 
            finish.tv_sec - start.tv_sec + 1E-6 * (finish.tv_usec - start.tv_usec));
    return 0;
}

int sumdigit(int a, int b) {
    // numlen = number of digit in a^b
    // pcount = power of 'a' after ith iteration
    // dcount = number of digit in a^(pcount)

    int numlen = (int) (b * log10(a)) + 1;
    char *arr = calloc(numlen, sizeof *arr);
    int pcount = 0;
    int dcount = 1;
    arr[numlen - 1] = 1;
    int i, sum, carry;

    while(pcount < b) {
        pcount += 1;

        sum = 0; 
        carry = 0;

        for(i = numlen - 1; i >= numlen - dcount; --i) {
            sum = arr[i] * a + carry;
            carry = sum / 10;
            arr[i] = sum % 10;
        }

        while(carry > 0) {
            dcount += 1;
            sum = arr[numlen - dcount] + carry;
            carry = sum / 10;
            arr[numlen - dcount] = sum % 10;
        } 
    }

    int result = 0;
    for(i = numlen - dcount; i < numlen; ++i)
        result += arr[i];

    free(arr);
    return result;
}

The commands I used to get different executable:

gcc -std=c99 -Wall -Wextra -Werror -pedantic -pedantic-errors pe16.c -o pe16_x32 -lm -mx32
gcc -std=c99 -Wall -Wextra -Werror -pedantic -pedantic-errors pe16.c -o pe16_32 -lm -m32
gcc -std=c99 -Wall -Wextra -Werror -pedantic -pedantic-errors pe16.c -o pe16_64 -lm

Here are the results I got:

ajay@ajay:c$ ./pe16_x32
Did 1000 calls in 89.19 seconds

ajay@ajay:c$ ./pe16_32
Did 1000 calls in 88.82 seconds

ajay@ajay:c$ ./pe16_64
Did 1000 calls in 92.05 seconds

Why does the 64-bit version runs slower than the 32-bit one? I read that the 64-bit architecture has improved instruction set and twice more general purpose registers compared to the 32-bit architecture which allows for more optimizations. When can I expect a better performance on a 64-bit system?

Edit I turned on the optimization using -O3 flag and now the results are:

ajay@ajay:c$ ./pe16_x32
Did 1000 calls in 38.07 seconds

ajay@ajay:c$ ./pe16_32
Did 1000 calls in 38.32 seconds

ajay@ajay:c$ ./pe16_64
Did 1000 calls in 38.27 seconds

Some 64-bit data is bigger than 32-bit data, and this can reduce cache hits. — Barmar, Feb 03 '14 at 11:07
Have you tried running the same program multiple times after each other? Or tried with different optimization flags? — Some programmer dude, Feb 03 '14 at 11:08
x32 <-> 32 differ only by 0.5% and x32 <-> 63 differ only by 3%. I'm inclined to file that under noise. Can you reliably reproduce these differences? — , Feb 03 '14 at 11:10
@JoachimPileborg Yes I ran them multiple times but the results each time were in the same order. pe16_32 took the least and pe1664 took the most out of the three. — ajay, Feb 03 '14 at 11:11
Only just noticed that you use no optimizations. That's silly, because it doesn't reflect how programs whose performance matters will/should be ran. — , Feb 03 '14 at 11:14
Benchmarking code without optimisations is utterly pointless. Do you want the code to run fast or not? — David Heffernan, Feb 03 '14 at 11:17
@DavidHeffernan I turned on the `-O3` flag and ran again. The results have dramatically changed. The execution time of each has gone to half and they are all equal up to the first point of decimal. — ajay, Feb 03 '14 at 11:23
@DavidHeffernan When can I expect a useful performance gain in this case and shall I remove this question? — ajay, Feb 03 '14 at 11:26
I'd expect identical performance for 32 and 64 bit for this code. The operands are the same size in all cases, and there is no floating point. — David Heffernan, Feb 03 '14 at 11:28
... and there aren't a lot of locals to take advantage of having more registers. And no function calls that can't be inlined, so no advantage from the register-call ABI of x86-64 and x32. — Peter Cordes, Jul 30 '15 at 16:43
So in short if sumdigit was put in a library (statically or dynamically linked) the results would revert back to the original differences because then sumdigit can't be inlined? — user1657170, Oct 28 '18 at 14:11

score 3 · Accepted Answer · answered Feb 03 '14 at 11:46

Comparing performance of code without optimisations is rather pointless. If you care about performance, you'll only ever use optimised code.

And when you enable optimisations you find that the performance differences are negligible. That is to be expected. The operations you perform are all integer based operations, using data of the same size in all cases. Since the 32 bit and 64 bit code run on the same integer hardware units you should expect the same performance.

You are not using any floating point operations which is one area where there are sometimes differences between 32 and 64 bit code due to different floating point hardware units (x64 uses SSE, x86 may use x87).

In short, the results are exactly as expected.

64-bit executable runs slower than 32-bit version

1 Answers1

Linked