1

My own implementation bite me back when trying to optimize following with SSE4:

std::distance(byteptr, std::mismatch(byteptr, ptr + lenght, dataptr).first)

This compares the byteptr and data and returns the index where bytes mismatch. I really do need the raw speed because I'm processing so much memory the RAM speed is already a bottleneck. Fetching and comparing 16 bytes at time with SSE4 would provide a speed boost since comparing 16 bytes at time is faster.

Here is my current code that I could not get working. It uses GCC SSE intrinsics and needs SSE4.2:

// define SIMD 128-bit type of bytes.
typedef char v128i __attribute__ ((vector_size(16)));
// mask of four low bits set.
const uintptr_t aligned_16_imask = (uintptr_t)15;
// mask of four low bits unset.
const uintptr_t aligned_16_mask = ~aligned_16_imask;

inline unsigned int cmp_16b_sse4(v128i *a, v128i *b) {
    return __builtin_ia32_pcmpistri128(__builtin_ia32_lddqu((char*)a), *b, 0x18);  
}

size_t memcmp_pos(const char * ptr1, const char * ptr2, size_t lenght)
{
    size_t nro = 0;
    size_t cmpsz;
    size_t alignlen = lenght & aligned_16_mask;
    // process 16-bytes at time.
    while(nro < alignlen) {
        cmpsz = cmp_16b_sse4((v128i*)ptr1, (v128i*)ptr2);
        ptr1 += cmpsz;
        ptr2 += cmpsz;
        nro += cmpsz;
        // if compare failed return now.
        if(cmpsz < 16)
            return nro;
        if(cmpsz != 16)
            break;
    }
    // process remainder 15 bytes:
    while( *ptr1 == *ptr2 && nro < lenght) {
        ++nro;
        ++ptr1;
        ++ptr2;
    }
    return nro;
}

When testing the above function it works most of the time but in some cases it fails.

JATothrim
  • 842
  • 1
  • 8
  • 24
  • What do you mean by 'it fails'? Crash, false positives/negatives...? – zx485 Sep 04 '17 at 00:26
  • SSE code above produces different results than std::mismatch based one. – JATothrim Sep 04 '17 at 14:50
  • 2
    I found out what I did wrong: the code should have used `pcmpestri` instead because `pcmpistri` actualy tries to handle null bytes. My input data is unstructured bits by nature so this broke the code. – JATothrim Sep 06 '17 at 16:34

1 Answers1

2

One known problem with pcmpistri is that it always reads the full 16 bytes - even beyond the end of the variable. This becomes a problem on a page boundary, on the border of allocated to unallocated memory. See here (scroll down to "Renat Saifutdinov").

This can be avoided by using only aligned reads of the source even if unaligned reads are supported, see this SO answer.

This could be one of the possibilities why your code fails.

zx485
  • 28,498
  • 28
  • 50
  • 59
  • I think the code doesn't suffer either of these problems. For the 16-bytes per round loop I have `lenght & aligned_16_mask` which rounds number of bytes to be processed in first loop to multiple of 16. – JATothrim Sep 04 '17 at 15:32