5

I'm optimizing the function, I try every way and even sse, and modified code to return from different position to see the calculate timespan but finally I found most of the time spends on the bool judgement. Even I replace all code in the if statement with a simple add operation in it, it still cost 6000ms.

My platform is gcc 4.7.1 e5506 cpu. Its input 'a' and 'b' is a 1000size int array, and 'asize', 'bsize' are corresponding array size. MATCH_MASK = 16383, I run the function 100000 times to statistics a timespan. Is there any good idea to the problem. Thank you!

   if (aoffsets[i] && boffsets[i])  // this line costs most time

Code:

uint16_t aoffsets[DOUBLE_MATCH_MASK] = {0}; // important! or it will only be right on the first time
uint16_t* boffsets = aoffsets + MATCH_MASK;
uint8_t* seen = (uint8_t *)aoffsets;
auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
{
    for (int i = 0; i < n_size; ++i)
        offsets[MATCH_STRIP(x[i])] = i;
};
fn_init_offsets(a, asize, aoffsets);
fn_init_offsets(b, bsize, boffsets);

uint8_t topcount = 0;
int topoffset = 0;
{
    std::vector<uint8_t> count_vec(asize + bsize + 1, 0);   // it's the fastest way already, very near to tls
    uint8_t* counts = &(count_vec[0]);
            //return aoffsets[0];   // cost 1375 ms
    for (int i = 0; i < MATCH_MASK; ++i)
    {
        if (aoffsets[i] && boffsets[i])  // this line costs most time
        {
                            //++affsets[i];  // for test
            int offset = (aoffsets[i] -= boffsets[i]);
            if ((-n_maxoffset <= offset && offset <= n_maxoffset))
            {
                offset += bsize;
                uint8_t n_cur_count = ++counts[offset];
                if (n_cur_count > topcount)
                {
                    topcount = n_cur_count;
                    topoffset = offset;
                }
            }
        }
    }
}
    return aoffsets[0];   // cost 6000ms
Joseph Quinsey
  • 9,553
  • 10
  • 54
  • 77
magicyang
  • 453
  • 1
  • 5
  • 14
  • Is this function threaded, could you have a thread waiting? – Fantastic Mr Fox Nov 22 '12 at 04:25
  • 1
    Sounds like branch prediction failure –  Nov 22 '12 at 04:35
  • Could you post the assembly that is generated? (gcc -S -masm=intel) – Alex I Nov 22 '12 at 04:42
  • 1
    how are you benchmarking it? – Cory Nelson Nov 22 '12 at 05:09
  • If you want a code review, please use http://codereview.stackexchange.com/. – Some programmer dude Nov 22 '12 at 06:06
  • 1
    Also see this epic question http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array . You might wan't to try to split `aoffsets[DOUBLE_MATCH_MASK]` into two arrays `MATCH_MASK` length each and sort them before processing. This could dramatically improve branch prediction –  Nov 22 '12 at 07:18
  • What optimisation switches are you using ? – Paul R Nov 22 '12 at 07:45
  • @aleguna I try your code but its result is wrong, it's different from the result of if(aoffsets[i] && boffsets[i]). In gcc your code is very fast and in vs2012 it's as slow as previous if statement.. – magicyang Nov 22 '12 at 08:06
  • What are you trying to do here -- what's the definition of MATCH_STRIP() ? – Christoffer Nov 22 '12 at 08:10
  • @magicyang, are you talking about trick with shifts I posted earlier and removed? I didn't try to compile/run it, but it should work... All it does is combines two 16bit unsigned integers into a 32bit one and then checks if that is not 0 using bitwise AND. –  Nov 22 '12 at 08:14
  • @Christoffer MATCH_STRIP is a MACRO, in fact it's unimportant, for I've test it cost only ten percent of the total time, and the for loop cost most of the time – magicyang Nov 22 '12 at 08:18
  • @magicyang well, without that definition I am unable to figure out just what your code is supposed to do, and thus unable to suggest a more effective solution. Your loss :-) – Christoffer Nov 22 '12 at 08:21
  • @aleguna ah..I've thought you are using some trick I can't understand.. Your thought is wrong from begin. The old judgement is to ensure none of aoffsets[i] and boffsets[i] is zero, but your edition is to ensure aoffsets[i] and boffsets[i] is not zero at the same time, if one of them is zero, your judge is error. – magicyang Nov 22 '12 at 08:27
  • @Christoffer static const int MATCH_BITS = 14; static const int MATCH_LEFT = (32 - MATCH_BITS); #define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT) – magicyang Nov 22 '12 at 08:29
  • 1
    why exactly was this question closed? It is an example of finding an interesting effect of underlying hardware on performance of a program? – Philipp Nov 22 '12 at 09:37
  • @Ben it's a function in single thread, and it will be invoked by multiple thread, so now I want increase it's single threaded performance at first :) – magicyang Nov 22 '12 at 09:42
  • **What optimisation switches are you using ?** – Paul R Nov 27 '12 at 09:12
  • use both of the two answers – magicyang Dec 28 '12 at 05:58

2 Answers2

4

First, memset(count_vec,0, N); of a memaligned buffer wins over std::vector by 30%.

You can try to use the branchless expression (aoffsets[i] * boffsets[i]) and calculate some of not-to-be-used expressions simultaneously: offset = aoffset[i]-boffset[i]; offset+bsize; offset+n_maxoffset;.

Depending on the typical range of offset, one could be tempted to calculate the min/max of (offset+bsize) to restrict the needed memset(count_vec) at the next iteration: no need to clear already zero values.

As pointed by philipp, it's good to interleave the operations -- then again, one can read both aoffset[i] and boffset[i] simultaneously from uint32_t aboffset[N]; with some clever bit masking (that generates change mask for: aoffset[i], aoffset[i+1]) one could possibly handle 2 sets in parallel using 64-bit simulated SIMD in pure c (up to the histogram accumulation part).

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • expression (aoffsets[i] * boffsets[i]) is faster then (aoffsets[i] && boffsets[i]) , 8775ms vs 9800ms, about ten percent faster on gcc, but ten percent slower on vs2012.. – magicyang Nov 22 '12 at 08:42
3

You can increase the speed of your program by reducing the cache misses: aoffsets[i] and boffsets[i] are relatively far away from each other in memory. By placing them next to each other, you speed up the program significantly. On my machine (e5400 cpu, VS2012) the execution time is reduced from 3.0 seconds to 2.3 seconds:

#include <vector>

#include <windows.h> 
#include <iostream> 


typedef unsigned short uint16_t;
typedef int int32_t;
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;

#define MATCH_MASK 16383
#define DOUBLE_MATCH_MASK (MATCH_MASK*2)
static const int MATCH_BITS = 14; 
static const int MATCH_LEFT = (32 - MATCH_BITS); 
#define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT)

static const int n_maxoffset = 1000;

uint16_t test(int32_t* a, int asize, int32_t* b, int bsize)
{
    uint16_t offsets[DOUBLE_MATCH_MASK] = {0}; 

    auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
    {
        for (int i = 0; i < n_size; ++i)
            offsets[MATCH_STRIP(x[i])*2 /*important. leave space for other offsets*/] = i;
    };
    fn_init_offsets(a, asize, offsets);
    fn_init_offsets(b, bsize, offsets+1);

    uint8_t topcount = 0;
    int topoffset = 0;
    {
        std::vector<uint8_t> count_vec(asize + bsize + 1, 0);   
        uint8_t* counts = &(count_vec[0]);
        for (int i = 0; i < MATCH_MASK; i+=2)
        {
            if (offsets[i] && offsets[i+1])  
            {
                int offset = (offsets[i] - offsets[i+1]); //NOTE: I removed 
                if ((-n_maxoffset <= offset && offset <= n_maxoffset))
                {
                    offset += bsize;
                    uint8_t n_cur_count = ++counts[offset];
                    if (n_cur_count > topcount)
                    {
                        topcount = n_cur_count;
                        topoffset = offset;
                    }
                }
            }
        }
    }
    return offsets[0];   
}


int main(int argc, char* argv[])
{
    const int sizes = 1000;
    int32_t* a = new int32_t[sizes];
    int32_t* b = new int32_t[sizes];
    for (int i=0;i<sizes;i++)
    {
        a[i] = rand()*rand();
        b[i] = rand()*rand();
    }

    //Variablen 
    LONGLONG g_Frequency, g_CurentCount, g_LastCount; 

    QueryPerformanceFrequency((LARGE_INTEGER*)&g_Frequency);
    QueryPerformanceCounter((LARGE_INTEGER*)&g_CurentCount); 

    int sum = 0;

    for (int i=0;i<100000;i++)
    {
        sum += test(a,sizes,b,sizes);
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&g_LastCount); 
    double dTimeDiff = (((double)(g_LastCount-g_CurentCount))/((double)g_Frequency)); 

    std::cout << "Result: " << sum << std::endl <<"time: " << dTimeDiff << std::endl; 


    delete[] a;
    delete[] b;
    return 0;
}

compared to your version of test().

#include <vector>

#include <windows.h> 
#include <iostream> 


typedef unsigned short uint16_t;
typedef int int32_t;
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;

#define MATCH_MASK 16383
#define DOUBLE_MATCH_MASK (MATCH_MASK*2)
static const int MATCH_BITS = 14; 
static const int MATCH_LEFT = (32 - MATCH_BITS); 
#define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT)
static const int n_maxoffset = 1000;

uint16_t test(int32_t* a, int asize, int32_t* b, int bsize)
{
    uint16_t aoffsets[DOUBLE_MATCH_MASK] = {0}; // important! or it will only be right on the first time
    uint16_t* boffsets = aoffsets + MATCH_MASK;

    auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
    {
        for (int i = 0; i < n_size; ++i)
            offsets[MATCH_STRIP(x[i])] = i;
    };
    fn_init_offsets(a, asize, aoffsets);
    fn_init_offsets(b, bsize, boffsets);

    uint8_t topcount = 0;
    int topoffset = 0;
    {
        std::vector<uint8_t> count_vec(asize + bsize + 1, 0);   
        uint8_t* counts = &(count_vec[0]);

        for (int i = 0; i < MATCH_MASK; ++i)
        {
            if (aoffsets[i] && boffsets[i]) 
            {

                int offset = (aoffsets[i] - boffsets[i]); //NOTE: I removed the -= because otherwise offset would always be positive!
                if ((-n_maxoffset <= offset && offset <= n_maxoffset))
                {
                    offset += bsize;
                    uint8_t n_cur_count = ++counts[offset];
                    if (n_cur_count > topcount)
                    {
                        topcount = n_cur_count;
                        topoffset = offset;
                    }
                }
            }
        }
    }
    return aoffsets[0];   
}


int main(int argc, char* argv[])
{
    const int sizes = 1000;
    int32_t* a = new int32_t[sizes];
    int32_t* b = new int32_t[sizes];
    for (int i=0;i<sizes;i++)
    {
        a[i] = rand()*rand();
        b[i] = rand()*rand();
    }

    LONGLONG g_Frequency, g_CurentCount, g_LastCount; 
    QueryPerformanceFrequency((LARGE_INTEGER*)&g_Frequency);
    QueryPerformanceCounter((LARGE_INTEGER*)&g_CurentCount); 

    int sum = 0;

    for (int i=0;i<100000;i++)
    {
        sum += test(a,sizes,b,sizes);
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&g_LastCount); 

    double dTimeDiff = (((double)(g_LastCount-g_CurentCount))/((double)g_Frequency)); 

    std::cout << "Result: " << sum << std::endl <<"time: " << dTimeDiff << std::endl; 


    delete[] a;
    delete[] b;
    return 0;
}
Philipp
  • 11,549
  • 8
  • 66
  • 126
  • Thank you very much! I use your method and reduce time from about 8800ms to 8199ms, about ten percent faster! So combining with your and the (aoffsets[i] * boffsets[i]), the function 's speed has increase about 20% – magicyang Nov 22 '12 at 10:33