Intel(x86_64) 64 bit vs 32 bit integer arithmetic performance difference

Question

I was testing some program and I came upon a rather unexpected anomaly.

I wrote a simple program that computed prime numbers, and used pthreads API to parallelize this workload.

After conducting some tests, I found that if i used uint64_t as the datatype for calculations and loops, the program took significantly more time to run than if i used uint32_t.

Here is the code that I ran:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <pthread.h>

#define UINT uint64_t
#define SIZE (1024 * 1024)

typedef struct _data
{
    UINT start;
    UINT len;
    int t;
    UINT c;
}data;

int isprime(UINT x)
{
    uint8_t flag = 1;

    if(x < 2)
        return 0;

    for(UINT i = 2;i < x/2; i++)
    {
        if(!(x % i ))
        {
            flag = 0;
            break;
        }
    }

    return flag;
}

void* calc(void *p)
{
    data *a = (data*)p;

    //printf("thread no. %d has start: %lu length: %lu\n",a->t,a->start,a->len);

    for(UINT i = a->start; i < a->len; i++)
    {
        if(isprime(i))
            a->c++;
    }

    //printf("thread no. %d found %lu primes\n", a->t,a->c);
    
    pthread_exit(NULL);
}

int main(int argc,char **argv)
{   
    pthread_t *t;
    data *a;
    uint32_t THREAD_COUNT;

    if(argc < 2)
        THREAD_COUNT = 1;
    else
        sscanf(argv[1],"%u",&THREAD_COUNT);
    
    t = (pthread_t*)malloc(THREAD_COUNT * sizeof(pthread_t));
    a = (data*)malloc(THREAD_COUNT * sizeof(data));

    printf("executing the application on %u thread(s).\n",THREAD_COUNT);

    for(uint8_t i = 0; i < THREAD_COUNT; i++)
    {
        a[i].t = i;
        a[i].start = i * (SIZE / THREAD_COUNT);
        a[i].len = a[i].start + (SIZE / THREAD_COUNT);
        a[i].c = 0;
    }
    
    for(uint8_t i = 0; i < THREAD_COUNT; i++)
        pthread_create(&t[i],NULL,calc,(void*)&a[i]);
    for(uint8_t i = 0; i < THREAD_COUNT; i++)
        pthread_join(t[i],NULL);

    free(a);
    free(t);
    
    return 0;
}

I changed the UINT macro between uint32_t and uint64_t and compiled and ran the program and determined its runtime using time command on linux.

I found major difference between the runtime for uint64_t vs uint32_t.

On using uint32_t it took the program 46s to run while using uint64_t it took 2m49s to run!

I wrote a blog post about it here : https://qcentlabs.com/index.php/2021/02/01/intelx86_64-64-bit-vs-32-bit-arithmetic-big-performance-difference/

You can check out the post if you want more information.

What might be the issue behind this? Are 64 bit arithmetic slower on x86_64 than 32 bit one?

Does the issue go away if you pad `struct _data` so that it's 64 bytes in size? A smaller size struct means more structs on the same 64-byte cache line, which will hurt performance. — Mikel Rychliski, Feb 01 '21 at 19:57
@MikelRychliski In case of uint32_t the size of stuct is 16 bytes and in case of uint64_t size of struct is 32 bytes, as per your recomendation, I pad both of them so that they be 64 bytes. After running the test again no difference was observed, with uint64_t it took 2m49s on single thread and uint32_t took 46s on single thread. — Ankit Sharma, Feb 01 '21 at 20:29
See https://uops.info/table.html?search=div&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on which confirms that 64-bit divide (invoked by your `x % i`) is about 3-4 times more expensive than 32-bit. So that seems consistent with what you observe. — Nate Eldredge, Feb 01 '21 at 23:31
duplicates: [Performance 32 bit vs. 64 bit arithmetic](https://stackoverflow.com/q/8948918/995714), [what is the performance impact of using int64_t instead of int32_t on 32-bit systems?](https://stackoverflow.com/q/16841382/995714), [Does x86-64 support one of 16- 32- or 64-bit better than the others?](https://stackoverflow.com/q/21918547/995714) — phuclv, Feb 02 '21 at 00:46
As a slight aside, for loops doing array indexing like in your main function, for best performance you generally want to make your loop counter of type `size_t`. — janneb, Feb 02 '21 at 07:14

score 3 · Accepted Answer · edited Apr 02 '22 at 20:38

3

In general 64-bit arithmetic is as fast as 32-bit, ignoring things like larger operands taking up more memory and BW (bandwidth), and on x86-64 addressing the full 64-bit registers requires longer instructions.

However, you have managed to hit one of the few exceptions to this rule, namely the div instruction for calculating divisions.

edited Apr 02 '22 at 20:38

Jason Sparc

702
2
7
29

answered Feb 01 '21 at 21:23

janneb

36,249
2
81
97

2

Fun fact: this is an Intel effect (on CPUs before IceLake). On AMD, `div r64` and `div r32` have (almost?) the same performance if the input numbers are the same. (see linked duplicates) – Peter Cordes Feb 02 '21 at 03:21

Intel(x86_64) 64 bit vs 32 bit integer arithmetic performance difference

1 Answers1