performance is degraded with -m64 flag

Question

I wrote a simple filter program to see if performance improvement is there with -m64 compiler option over -m32.

Here is my whole code

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

#include<sys/time.h>
#define __STDC_FORMAT_MACROS 1
#include<inttypes.h>

#define tap_size 5
int luma_stride=640;
int luma_ht=480;
int croma_stride=320;
int croma_ht=240;

int filter[tap_size]={-3,2,3,2,-3};


struct timeval tv1, tv2,tv3;
uint64_t  ui1;
uint64_t total_time=0;

uint64_t GetTimeStamp();
void process_frame(unsigned char *ip_buffer, unsigned char * op_buffer, int ip_buf_size, int op_buf_size);


int main()
{

    int ip_buf_size;
    int op_buf_size;


    unsigned char * ip_buffer;
    unsigned char * op_buffer;
    unsigned char * temp;



    ip_buf_size=luma_stride*luma_ht + 2*croma_stride * croma_ht;
    op_buf_size=ip_buf_size; //

    ip_buffer = (unsigned char *)malloc(ip_buf_size*sizeof(char));
    op_buffer = (unsigned char *)malloc(ip_buf_size*sizeof(char));;
    temp=ip_buffer;
    for(int i=0;i<ip_buf_size;i++)
    {
        *temp=rand();
    }

    for(int i=0;i<100;i++)
    {
        ui1=GetTimeStamp();
        process_frame(ip_buffer, op_buffer, ip_buf_size, op_buf_size);//process
        total_time+=GetTimeStamp()-ui1;
    }
    free(ip_buffer);
    free(op_buffer);
    printf("\nTotal time=%" PRIu64 " us\n", total_time);
    return 0;
}



uint64_t GetTimeStamp()
{
    struct timeval tv;
    gettimeofday(&tv,NULL);
    return tv.tv_sec*(uint64_t)1000000+tv.tv_usec;
}


void process_frame(unsigned char *ip_buffer, unsigned char * op_buffer, int ip_buf_size, int op_buf_size)
{

    int i,j;
    unsigned char *ptr1,*ptr2;
    unsigned char *temp_buffer=(unsigned char *) malloc(op_buf_size*sizeof(unsigned char));

    ptr1=ip_buffer;
    //ptr2=temp_buffer;
    ptr2=op_buffer;


    //Vertical filter

    //Luma
    /*  for(j=0;j<tap_size/2;j++)
     {
     for(i=0;i<luma_stride;i++)
     {
     *ptr2++=*ptr1++;
     }
     } */

    memcpy(ptr2,ptr1,2*luma_stride*sizeof(unsigned char));
    ptr1=ip_buffer+2*luma_stride;
    ptr2=op_buffer+2*luma_stride;

    for(i=0;i<luma_ht-tap_size+1;i++)
    {

        for(j=0;j<luma_stride;j++)
        {
            int k;
            long int temp=0;
            for(k=0;k<tap_size;k++)
            {
                temp+=filter[k]**(ptr1+(k-tap_size/2)*luma_stride);
            }
            //temp=temp>>4;
            if(temp>255) temp =255;
            else if(temp<0) temp=0;
            *ptr2=temp;
            ++ptr1;
            ++ptr2;
        }

    }

    memcpy(ptr2,ptr1,2*luma_stride*sizeof(unsigned char));
    ptr1=ptr1+2*luma_stride;
    ptr2=ptr2+2*luma_stride;

    //Copy croma values as it is!
    for(i=luma_ht*luma_stride;i<ip_buf_size;i++)
    {
        op_buffer[i]=ip_buffer[i];
    }
}

I compiled it with these two options

g++ -O3 program.c -o filter64 -m64

and

g++ -O3 program.c -o filter32 -m32

Now,

outputs of ./filter32 is

Total time=106807 us

and that of ./filter64 is

Total time=140699 us

My question is shouldn't it be other way ? i.e time taken by filter64 should be less than that of filter32 as with 64 bit architecture we have more registers? How can I achieve that ? or is there any compiler option which takes care of that ? Please help.

I am using ubuntu on intel 64 bit machine.

Are the results the same if you measure time only ones, before the for loop and after (not 100 times inside the for)? Are the results the same if you increase number of repeats from 100 to for example 100000? — Zuljin, Feb 02 '15 at 10:52
Code that uses pointers extensively may degrade performance since less pointers will fit in cache now. You should also inline `GetTimeStamp` and use a higher precision timer if possible. Another solution is [`-mx32`](https://sites.google.com/site/x32abi/) — phuclv, Feb 02 '15 at 11:06

Paul R · Accepted Answer · 2015-02-02T11:01:25.387

2

There are various trade-offs when you switch from 32 bit to 64 bit. On the down-side all pointers become twice the size, and it can take a longer instruction sequence to load an immediate address into a register. Unless your application is register-starved or needs > 4 GB address space then you might want to keep it 32 bit.

Also note that your timing method is somewhat suspect - you may just be seeing the effect of page faults etc - you should put your test code in a loop, with memory allocation outside the loop and the processing code inside the loop. Ignore the first iteration for timing purposes. That way all memory is wired and the caches are warmed before you start timing.

One further problem: you seem to have a memory leak in process_frame, which as well as being a bug may also may make the timing unreliable.

edited Feb 02 '15 at 11:01

answered Feb 02 '15 at 10:44

Paul R

208,748
37
389
560

Changed code to accomodate all your suggestions, but no luck :( `process_frame(ip_buffer, op_buffer, ip_buf_size, op_buf_size);//process ui1=GetTimeStamp(); for(i=0;i<100;i++) { process_frame(ip_buffer, op_buffer, ip_buf_size, op_buf_size);//process } total_time+=GetTimeStamp()-ui1;' – Vikram Dattu Feb 02 '15 at 10:53
Did you move the `malloc`s/`free`s out of `process_frame()` or are they still in the loop ? Actually I see you're leaking memory inside `process_frame()` now, so that's not going to help... – Paul R Feb 02 '15 at 10:56
malloc and free are outside the loop in main function. So there isn't any chance of memory leak. – Vikram Dattu Feb 02 '15 at 11:00
Look more carefully - you are `malloc`-ing a temporary buffer (`temp_buffer`) inside `process_frame()` which you never `free` (actually it looks like you never even use it ?). – Paul R Feb 02 '15 at 11:02

score 1 · Answer 2 · edited May 23 '17 at 10:24

1

Why are you using a C++ compiler to compile C? It makes your C code worse, since you must do horrible things like having to cast the return value of malloc() and such.

Also, are you certain that your program's performance is limited by available registers? You need to profile your program to figure out exactly where time is spent, to figure out if/how moving to 64-bit can make it faster.

It's not as simple as "all code is faster when built for for 64-bit, because there are more registers".

edited May 23 '17 at 10:24

Community

1
1

answered Feb 02 '15 at 10:43

unwind

391,730
64
469
606

Compiling with C compiler (gcc), is giving similar results. I was expecting that at least degradation should not be there while switching from 32 to 64. I will go for profiling as you suggested and let you know. – Vikram Dattu Feb 02 '15 at 10:58

performance is degraded with -m64 flag

2 Answers2