2

I was testing some program and I came upon a rather unexpected anomaly.

I wrote a simple program that computed prime numbers, and used pthreads API to parallelize this workload.

After conducting some tests, I found that if i used uint64_t as the datatype for calculations and loops, the program took significantly more time to run than if i used uint32_t.

Here is the code that I ran:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <pthread.h>

#define UINT uint64_t
#define SIZE (1024 * 1024)

typedef struct _data
{
    UINT start;
    UINT len;
    int t;
    UINT c;
}data;

int isprime(UINT x)
{
    uint8_t flag = 1;

    if(x < 2)
        return 0;

    for(UINT i = 2;i < x/2; i++)
    {
        if(!(x % i ))
        {
            flag = 0;
            break;
        }
    }

    return flag;
}

void* calc(void *p)
{
    data *a = (data*)p;

    //printf("thread no. %d has start: %lu length: %lu\n",a->t,a->start,a->len);

    for(UINT i = a->start; i < a->len; i++)
    {
        if(isprime(i))
            a->c++;
    }

    //printf("thread no. %d found %lu primes\n", a->t,a->c);
    
    pthread_exit(NULL);
}

int main(int argc,char **argv)
{   
    pthread_t *t;
    data *a;
    uint32_t THREAD_COUNT;

    if(argc < 2)
        THREAD_COUNT = 1;
    else
        sscanf(argv[1],"%u",&THREAD_COUNT);
    
    t = (pthread_t*)malloc(THREAD_COUNT * sizeof(pthread_t));
    a = (data*)malloc(THREAD_COUNT * sizeof(data));

    printf("executing the application on %u thread(s).\n",THREAD_COUNT);

    for(uint8_t i = 0; i < THREAD_COUNT; i++)
    {
        a[i].t = i;
        a[i].start = i * (SIZE / THREAD_COUNT);
        a[i].len = a[i].start + (SIZE / THREAD_COUNT);
        a[i].c = 0;
    }
    
    for(uint8_t i = 0; i < THREAD_COUNT; i++)
        pthread_create(&t[i],NULL,calc,(void*)&a[i]);
    for(uint8_t i = 0; i < THREAD_COUNT; i++)
        pthread_join(t[i],NULL);

    free(a);
    free(t);
    
    return 0;
}

I changed the UINT macro between uint32_t and uint64_t and compiled and ran the program and determined its runtime using time command on linux.

I found major difference between the runtime for uint64_t vs uint32_t.

On using uint32_t it took the program 46s to run while using uint64_t it took 2m49s to run!

I wrote a blog post about it here : https://qcentlabs.com/index.php/2021/02/01/intelx86_64-64-bit-vs-32-bit-arithmetic-big-performance-difference/

You can check out the post if you want more information.

What might be the issue behind this? Are 64 bit arithmetic slower on x86_64 than 32 bit one?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Does the issue go away if you pad `struct _data` so that it's 64 bytes in size? A smaller size struct means more structs on the same 64-byte cache line, which will hurt performance. – Mikel Rychliski Feb 01 '21 at 19:57
  • @MikelRychliski In case of uint32_t the size of stuct is 16 bytes and in case of uint64_t size of struct is 32 bytes, as per your recomendation, I pad both of them so that they be 64 bytes. After running the test again no difference was observed, with uint64_t it took 2m49s on single thread and uint32_t took 46s on single thread. – Ankit Sharma Feb 01 '21 at 20:29
  • 2
    See https://uops.info/table.html?search=div&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on which confirms that 64-bit divide (invoked by your `x % i`) is about 3-4 times more expensive than 32-bit. So that seems consistent with what you observe. – Nate Eldredge Feb 01 '21 at 23:31
  • 1
    duplicates: [Performance 32 bit vs. 64 bit arithmetic](https://stackoverflow.com/q/8948918/995714), [what is the performance impact of using int64_t instead of int32_t on 32-bit systems?](https://stackoverflow.com/q/16841382/995714), [Does x86-64 support one of 16- 32- or 64-bit better than the others?](https://stackoverflow.com/q/21918547/995714) – phuclv Feb 02 '21 at 00:46
  • As a slight aside, for loops doing array indexing like in your main function, for best performance you generally want to make your loop counter of type `size_t`. – janneb Feb 02 '21 at 07:14

1 Answers1

3

In general 64-bit arithmetic is as fast as 32-bit, ignoring things like larger operands taking up more memory and BW (bandwidth), and on x86-64 addressing the full 64-bit registers requires longer instructions.

However, you have managed to hit one of the few exceptions to this rule, namely the div instruction for calculating divisions.

Jason Sparc
  • 702
  • 2
  • 7
  • 29
janneb
  • 36,249
  • 2
  • 81
  • 97
  • 2
    Fun fact: this is an Intel effect (on CPUs before IceLake). On AMD, `div r64` and `div r32` have (almost?) the same performance if the input numbers are the same. (see linked duplicates) – Peter Cordes Feb 02 '21 at 03:21