Comparing two vector with SSE

Question

I have two vector<bool> A and B.

I want to compare them and count the number of elements that are equal:

For example:

A = {0,1,0,1}
B = {0,0,1,1}

Result will be equal to 2.

I can use _mm_cmpeq_epi8 but it is only compare 16 elements (i.e. I should convert 0 and 1 to char and then do the comparison). Is it possible to compare 128 elements each time with SSE (or SIMD instructions)?

Be aware that `vector` has special properties and isn't the same as an array of `bool` objects. I'm not sure if you have taken this into account because I haven't used those instructions before. — Neil Kirk, Dec 13 '15 at 23:38
This will be implementation-dependent, since the elements of a `vector` may be bytes, or packed bits, or anything else that the implementer sees fit to use. If you want to do this portably then use e.g. `vector`, and then an SSE implementation will be relatively straightforward. — Paul R, Dec 14 '15 at 00:13
You need to figure out how to get a pointer to the container that holds the actual integers in your implementation. I've done this myself but I never got around to making it generic enough to share here... — user541686, Dec 14 '15 at 08:13
Do you have to use `vector`? And do you want to store each bool as a byte or could you use bits? — Z boson, Dec 14 '15 at 08:39
If you know the size of your vectors at compile time, a bitset might be the solution at it would provide xor and counting the number of bits set. Reasonably good compilers should implement these with available SIMD instructions. — Come Raczy, Dec 14 '15 at 23:05
I can only vote for the suggestion to xor the bitmasks and then use 64-bit popcount instruction (see [this answer](http://stackoverflow.com/a/17355341/556899)). Note that you should store your data in custom bitsets, not in the awful `std::vector` trash. — stgatilov, Dec 20 '15 at 10:48

101010 · Answer 1 · 2015-12-14T23:48:15.190

std::vector<bool> is a specialization of std::vector for the type bool. Although not specified by the C++ standard, in most implementations std::vector<bool> is made space efficient such that each of its element is a single bit instead of a bool.

The behaviour of std::vector<bool> is similar to its primarily template counterpart, except that:

std::vector<bool> does not necessarily store its element contiguously .
In order to expose its elements (i.e., the individual bits) std::vector<bool> uses a proxy class (i.e., std::vector<bool>::reference). Objects of class std::vector<bool>::reference are returned by std::vector<bool> subscript operator (i.e., operator[]) by value.

Accordingly, I don't think it's portable to use _mm_cmpeq_epi8 like functions since storage of a std::vector<bool> is implementation defined (i.e., not guaranteed contiguous).

An alternative but portable way is to use regular STL facilities like the example below:

std::vector<bool> A = {0,1,0,1};
std::vector<bool> B = {0,0,1,1};
std::vector<bool> C(A.size());
std::transform(A.begin(), A.end(), B.begin(), C.begin(), [](bool const &a, bool const &b) { return a == b;});
std::cout << std::count(C.begin(), C.end(), true) << std::endl;

Live Demo

You can use `std::inner_product` to count equal elements without the temporary vector of your STL example. [Demo on ideone.com.](http://ideone.com/X50YDU) I'm not sure if it's cool or algorithm abuse. — Blastfurnace, Dec 14 '15 at 00:45
Why `a && a==b`? OP wants to count positions where they're equal, So OP wants to count 0==0 as well as 1==1. Hopefully a compiler will realize that it can do this with `!(a ^ b)`, and end up making code that implements the optimal asm implementation of `64-popcnt(A[i] XOR B[i])` (in 64b chunks). Or XOR/popcnt in vector registers if you can't assume hardware popcnt instruction support, and have to do it with vector shuffles or the bithack method in vector elements. — Peter Cordes, Dec 14 '15 at 23:41

Paul R · Accepted Answer · 2015-12-14T07:54:52.717

If you can either assume that vector<bool> is using contiguous byte-sized elements for storage, or if you can consider using something like vector<uint8_t> instead, then this example should give you a good starting point:

static size_t count_equal(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
    assert(vec1.size() == vec2.size());         // vectors must be same size

    const size_t n = vec1.size();
    const size_t max_block_size = 255 * 16;     // max block size before possible overflow

    __m128i vcount = _mm_setzero_si128();
    size_t i, count = 0;

    for (i = 0; i + 16 <= n; )                  // for each block
    {
        size_t m = std::min(n, i + max_block_size);

        for ( ; i + 16 <= m; i += 16)           // for each vector in block
        {
            __m128i v1 = _mm_loadu_si128((__m128i *)&vec1[i]);
            __m128i v2 = _mm_loadu_si128((__m128i *)&vec2[i]);
            __m128i vcmp = _mm_cmpeq_epi8(v1, v2);
            vcount = _mm_sub_epi8(vcount, vcmp);
        }
        vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
        count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
        vcount = _mm_setzero_si128();           // update count from current block
    }
    vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
    count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
    for ( ; i < n; ++i)                         // deal with any remaining partial vector
    {
        count += (vec1[i] == vec2[i]);
    }
    return count;
}

Note that this is using vector<uint8_t>. If you really have to use vector<bool> and can guarantee that the elements will always be contiguous and byte-sized then you'll just need to coerce the vector<bool> into a const uint8_t * or similar somehow.

Test harness:

#include <cassert>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <vector>

#include <emmintrin.h>    // SSE2

using std::vector;

static size_t count_equal_ref(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
    assert(vec1.size() == vec2.size());

    const size_t n = vec1.size();
    size_t i, count = 0;

    for (i = 0 ; i < n; ++i)
    {
        count += (vec1[i] == vec2[i]);
    }
    return count;
}

static size_t count_equal(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
    assert(vec1.size() == vec2.size());         // vectors must be same size

    const size_t n = vec1.size();
    const size_t max_block_size = 255 * 16;     // max block size before possible overflow

    __m128i vcount = _mm_setzero_si128();
    size_t i, count = 0;

    for (i = 0; i + 16 <= n; )                  // for each block
    {
        size_t m = std::min(n, i + max_block_size);

        for ( ; i + 16 <= m; i += 16)           // for each vector in block
        {
            __m128i v1 = _mm_loadu_si128((__m128i *)&vec1[i]);
            __m128i v2 = _mm_loadu_si128((__m128i *)&vec2[i]);
            __m128i vcmp = _mm_cmpeq_epi8(v1, v2);
            vcount = _mm_sub_epi8(vcount, vcmp);
        }
        vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
        count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
        vcount = _mm_setzero_si128();           // update count from current block
    }
    vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
    count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
    for ( ; i < n; ++i)                         // deal with any remaining partial vector
    {
        count += (vec1[i] == vec2[i]);
    }
    return count;
}

int main(int argc, char * argv[])
{
    size_t n = 100;

    if (argc > 1)
    {
        n = atoi(argv[1]);
    }

    vector<uint8_t> vec1(n);
    vector<uint8_t> vec2(n);

    srand((unsigned int)time(NULL));

    for (size_t i = 0; i < n; ++i)
    {
        vec1[i] = rand() & 1;
        vec2[i] = rand() & 1;
    }

    size_t n_ref = count_equal_ref(vec1, vec2);
    size_t n_test = count_equal(vec1, vec2);

    if (n_ref == n_test)
    {
        std::cout << "PASS" << std::endl;
    }
    else
    {
        std::cout << "FAIL: n_ref = " << n_ref << ", n_test = " << n_test << std::endl;
    }

    return 0;
}

Compile and run:

$ g++ -Wall -msse3 -O3 test.cpp && ./a.out
PASS

It is failed with `n = 1000000`. What is this ` assert(vec1.size() <= 255 * 16);` for? — user1436187, Dec 14 '15 at 00:53
We are using 8 bit vector elements to count the matches, so overflow is possible when there are more than 255*16 elements in the vector. In the question you specified a vector size of 128 so I didn't bother handling large vector sizes, however I have now updated the solution above to support vectors of any length. — Paul R, Dec 14 '15 at 07:56
Using a vector accumulator every 255 would be more efficient than two `_mm_extract`. `_mm_sad_epu8` zeros the upper bits, so you could accumulate with `_mm_add_epi32` (or `_mm_add_epi64` if you don't care about it being slower on Intel pre-Nehalem/Atom/Silvermont). Then you only need to extract once at the end. — Peter Cordes, Dec 14 '15 at 22:49
@PeterCordes: yes, good point - the original implementation didn't have the block loop and I just added it hastily when the OP complained about it not working with longer vectors. Having said that, the inner loop is typically going to iterate 255 times, so the cost of the horizontal sum is relatively insignificant, expect for very short vectors. — Paul R, Dec 14 '15 at 23:38

Comparing two vector with SSE

2 Answers2