C++ use SSE instructions for comparing huge vectors of ints

Question

I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:

    int getDiff(int indx1, int indx2) {
    int result = 0;
    int pplus, pminus, tmp;

    for (int k = 0; k < 128; k += 2) {
        pplus = nodeL[indx2][k] - nodeL[indx1][k];
        pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];

        tmp = max(pplus, pminus);
        if (tmp > result) {
            result = tmp;
        }
    }
    return result;
}

As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.

Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance

Benchmark:

    int nofTestCases = 10000000;

    vector<int> nodeIds(nofTestCases);
    vector<int> goalNodeIds(nofTestCases);
    vector<int> results(nofTestCases);

    for (int l = 0; l < nofTestCases; l++) {
        nodeIds[l] = randomNodeID(18000000);
        goalNodeIds[l] = randomNodeID(18000000);
    }



    double time, result;

    time = timestamp();
    for (int l = 0; l < nofTestCases; l++) {
        results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
    }
    result = timestamp() - time;
    cout << result / nofTestCases << "s" << endl;

    time = timestamp();
    for (int l = 0; l < nofTestCases; l++) {
        results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
    }
    result = timestamp() - time;
    cout << result / nofTestCases << "s" << endl;

where

int randomNodeID(int n) {
    return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}

/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}

Each time you call the [] operator, it will re-lookup the element. This gets worse for arrays of arrays (you didn't explicitly state the type of your "vector"). You can cache the element nodeL[indx2] in a local variable, for example, and use this in your for loop. That should boost speed already. — Neil Kirk, Jul 22 '13 at 15:50
Have you checked that the compiler isn't already doing SSE operations? — Mats Petersson, Jul 22 '13 at 15:50
Well it depends on the types. But double-deferencing a std::vector of std::vectors can be costly. It might not matter most of the time, but it surely will millions of times. — Neil Kirk, Jul 22 '13 at 15:53
@AlexandroE.: What pattern is `indx1` and `indx2` - is that just a for-loop inside another for loop doing 18M in each? Or is there some other pattern to it? — Mats Petersson, Jul 22 '13 at 15:55
@Mats indx1 and indx2 are random. They are IDs from a graph and therefore they may be anything. — Alexandros, Jul 22 '13 at 15:56
@Mats how do I check for SSE operations in the compiler. I have not done it before — Alexandros, Jul 22 '13 at 15:57
@AlexandrosE. To check for SSE: Run the compiler with `assembler output` and see what it generates, pretty much. — Mats Petersson, Jul 22 '13 at 16:06
@AlexandrosE. Does it always cover every value of the 18M, or is your code only working over a small number of those values? — Mats Petersson, Jul 22 '13 at 16:06
@MatsPetersson. It might not explore the whole graph in one batch but running a 1000 experiments will probably use all the available 18000000 IDs. — Alexandros, Jul 22 '13 at 16:09
You might want to check out the commercial Intel Performance Primitives Library or the free Framewave library. They allow you to take advantage of SSE from a C interface. However I see a potential problem: You are treating every second element in your vector differently. To really take advantage of SSE the same operation should be performed on neighboring elements. — Compuholic, Jul 22 '13 at 16:15
@AlexandrosE. Yes, but touching all 18M in one long sequence is quite different from touching 18000 of them "at random", from a caching perspective, for example. — Mats Petersson, Jul 22 '13 at 16:24
And using `g++ -O3` and GCC 4.6.3 on Fedora gives SSE instructions. Your mileage may vary on other compilers. — Mats Petersson, Jul 22 '13 at 16:25
In modern programs memory is usually more of a bottleneck than CPU, especially for large datasets. You might want to revise your data structure and aim toward something *more compact*. Ideally you would want to revise the access patterns too, but that is very dependent on the algorithm. In your particular case, I recommend using a specific allocator for the inner vectors to pool those allocations into a dedicated arena, and if possible use smaller types than `int` to maximize the numbers of "node lines" per cache line. — Matthieu M., Jul 22 '13 at 17:39

score 7 · Accepted Answer · answered Jul 23 '13 at 07:14

7

FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:

#include <smmintrin.h>

int getDiff_SSE(int indx1, int indx2)
{
    int result[4] __attribute__ ((aligned(16))) = { 0 };

    const int * const p1 = &nodeL[indx1][0];
    const int * const p2 = &nodeL[indx2][0];

    const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
    const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);

    __m128i vresult = _mm_set1_epi32(0);

    for (int k = 0; k < 128; k += 4)
    {
        __m128i v1, v2, vmax;

        v1 = _mm_loadu_si128((__m128i *)&p1[k]);
        v2 = _mm_loadu_si128((__m128i *)&p2[k]);
        v1 = _mm_xor_si128(v1, vke);
        v2 = _mm_xor_si128(v2, vko);
        v1 = _mm_sub_epi32(v1, vke);
        v2 = _mm_sub_epi32(v2, vko);
        vmax = _mm_add_epi32(v1, v2);
        vresult = _mm_max_epi32(vresult, vmax);
    }
    _mm_store_si128((__m128i *)result, vresult);
    return max(max(max(result[0], result[1]), result[2]), result[3]);
}

answered Jul 23 '13 at 07:14

Paul R

208,748
37
389
560

1

I like this answer. It has the XOR step I was trying to figure out. – Delta_Fore Jul 23 '13 at 08:36
Thanks - of course it would probably be a lot more impressive if it weren't for the previously discussed memory bandwidth bottleneck. – Paul R Jul 23 '13 at 09:09
2

I was thinking this could be a candidate for CUDA processing, where the initial hit of copying the data onto the board would be offset by the gain in processing. And then use the CPU for something else. But without reference to the complete intent it's hard to say – Delta_Fore Jul 23 '13 at 09:12
1

One problem with doing this with CUDA is that it's a reduction, and reductions tend to be rather messy and inefficient on GPUs. But it would be interesting to try it, particularly given the abundant memory bandwidth on modern GPUs. – Paul R Jul 23 '13 at 09:27
1

Your method gives a solid 10% improvement in my AMD FX 8350. So, you are the MAN!! Thanks. – Alexandros Jul 23 '13 at 14:27

score 3 · Answer 2 · answered Jul 22 '13 at 16:00

You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.

The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.

There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.

It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.

I agree, your choice of data structure and method of access is important. — Neil Kirk, Jul 22 '13 at 16:03
What do you mean choice of data? It is vector >. It does not get simpler than that. Am I missing something in your comment? — Alexandros, Jul 22 '13 at 16:05

Neil Kirk · Answer 3 · 2013-07-22T16:19:43.047

3

This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.

int getDiff(int indx1, int indx2) {
    int result = 0;
    int pplus, pminus, tmp;

    const vector<int>& nodetemp1 = nodeL[indx1];
    const vector<int>& nodetemp2 = nodeL[indx2];

    for (int k = 0; k < 128; k += 2) {
        pplus = nodetemp2[k] - nodetemp1[k];
        pminus = nodetemp1[k + 1] - nodetemp2[k + 1];

        tmp = max(pplus, pminus);
        if (tmp > result) {
            result = tmp;
        }
    }
    return result;
}

edited Jul 22 '13 at 16:19

answered Jul 22 '13 at 16:11

Neil Kirk

21,327
9
53
91

1

I benchmarked this, compared to the original version. Numbers aren't 100% stable (because my machine is doing other things at the same time), but it gives at best 10% improvement on `g++ -O3`. – Mats Petersson Jul 22 '13 at 16:23
Unfortunately Neil it is not. For 100000 iterations which is faster (original or your method) is random – Alexandros Jul 22 '13 at 16:24
10% for two lines of code isn't bad. Alexandros, how did you test? – Neil Kirk Jul 22 '13 at 16:26
And did you test with 8GB+ of data as is the original problem? You can't compare 8MB test with 8GB. – Neil Kirk Jul 22 '13 at 16:29
100000 random nodeIds and goalNodeIds stored in a vector. After those vector are created, I measured performance for 100000 getDiff for those vectors and then 100000 getDiff2. Usually, the one which is put second is faster, due to the random nodes vectors are already in the cache. – Alexandros Jul 22 '13 at 16:29

Delta_Fore · Answer 4 · 2013-07-22T21:33:55.477

A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.

I've tried to rewrite it using SSE instructions (AVX) using library here

The original code on my system ran in 11.5s With Neil Kirk's optimisation, it went down to 10.5s

EDIT: Tested the code with a debugger rather than in my head!

int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
    Vec4i result(0);
    const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];

Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
    Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
    Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
    Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
    Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
    Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
    Vec4i max_tmp = max(tmp_a,tmp_b);
    result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);

}

The lack of branching speeds it up to 9.5s but still data is the biggest impact.

If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.

EDIT I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure

EDIT 2: Tested the code with a debugger rather than in my head! Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.

It is obvious that my compiler did a lot of the optimizations already. But with your code I get a 3.5% better performance. I guess that is as good as it can get. Thanks man. — Alexandros, Jul 22 '13 at 20:03
I run some test to see if I get the same results but I do not. Are you 100% sure about the results? This max(pplus,pminus) does not seem correct, since pplus is valid only for 0,2,4,6 and pminus is only valid for 1,3,5,7. And you do not seem to use the mask as well — Alexandros, Jul 22 '13 at 20:23
@Ronnie. Thanks a lot for your effort. I have upvoted your answer but I had to choose Paul R answer, since it does not rely in external libraries. Hope you do not mind — Alexandros, Jul 23 '13 at 14:29

C++ use SSE instructions for comparing huge vectors of ints

4 Answers4

Linked