5

I was thinking about solving this, but it's looking to be quite a task. If I take this one by myself, I'll likely write it several different ways and pick the best, so I thought I'd ask this question to see if there's a good library that solves this already or if anyone has thoughts/advice.

void OffsetMemCpy(u8* pDest, u8* pSrc, u8 srcBitOffset, size size)
{
    // Or something along these lines. srcBitOffset is 0-7, so the pSrc buffer 
    // needs to be up to one byte longer than it would need to be in memcpy.
    // Maybe explicitly providing the end of the buffer is best.
    // Also note that pSrc has NO alignment assumptions at all.
}

My application is time critical so I want to nail this with minimal overhead. This is the source of the difficulty/complexity. In my case, the blocks are likely to be quite small, perhaps 4-12 bytes, so big-scale memcpy stuff (e.g. prefetch) isn't that important. The best result would be the one that benches fastest for constant 'size' input, between 4 and 12, for randomly unaligned src buffers.

  • Memory should be moved in word sized blocks whenever possible
  • Alignment of these word sized blocks is important. pSrc is unaligned, so we may need to read a few bytes off the front until it is aligned.

Anyone have, or know of, a similar implemented thing? Or does anyone want to take a stab at writing this, getting it to be as clean and efficient as possible?

Edit: It seems people are voting this "close" for "too broad". A few narrowing details would be AMD64 is the preferred architecture, so lets assume that. This means little endian etc. The implementation would hopefully fit well within the size of an answer so I don't think this is too broad. I'm asking for answers that are a single implementation at a time, even though there are a few approaches.

VoidStar
  • 5,241
  • 1
  • 31
  • 45
  • 3
    Write a clean and simple version first (should be quite trivial), then profile it in your application - you may find it's plenty fast enough for your needs and thereby avoid the pitfalls of premature optimisation. If not then you have a useful baseline reference implementation for further work. – Paul R Aug 17 '15 at 06:33
  • Yeah, that's what I'm doing now. I can't believe this isn't already solved though. This is for a hash table in a chess engine though, the overall code I'm replacing takes 60% of the CPU in just a a handful of lines of code. – VoidStar Aug 17 '15 at 06:36
  • I've seen similar questions here on SO before but I don't have time to look for duplicates just now - you might want to try some further searches. – Paul R Aug 17 '15 at 06:37
  • 1
    I have never seen unaligned accesses in chess program hash tables before. They are usually *very* carefully cache line aligned. – Bo Persson Aug 17 '15 at 06:44
  • 2
    See also: http://stackoverflow.com/questions/18803638/bitwise-memmove – Paul R Aug 17 '15 at 06:50
  • Thanks, the question is relevant, one answer is one the right track but isn't exactly optimized. I guess it does seem easier to copy, then shift, but I'm not sure if that's really the best way. I was hoping there was more interest than this in this type of question but I guess I'll have to take what I can get. – VoidStar Aug 17 '15 at 06:58
  • Well give it some time - you only posted the question 40 minutes ago - most of the US is asleep currently, and Europeans are on their way into work - you might see some responses later today... – Paul R Aug 17 '15 at 07:04
  • @Bo Persson: This is experimental. My transposition table cache scheme is so effective that it is worth making it huge, even though all the time is spent in that code, it saves cycles overall. In particular, I am re-writing the trampoline to be a "packed bit array". The hash table size is in the gigabytes already, and squeezing more out of the memory might yield a speedup, despite the extra cycles packing/unpacking. – VoidStar Aug 17 '15 at 08:00
  • 2
    Please choose one of C and C++. – fuz Aug 17 '15 at 09:30
  • Some more related questions which might also be useful: http://stackoverflow.com/questions/5704597/fastest-way-to-write-a-bitstream-on-modern-x86-hardware and http://stackoverflow.com/questions/6346637/append-1-to-32-bit-numbers-to-a-char-buffer/6348054#6348054 – Paul R Aug 17 '15 at 14:11
  • Here is a may be faster implementation: https://stackoverflow.com/questions/17320643/is-it-possible-to-do-memcpy-in-bits-instead-of-bytes/71347247#71347247 – Andry Mar 04 '22 at 06:32

2 Answers2

5

I would start with a simple implementation such as this:

inline void OffsetMemCpy(uint8_t* pDest, const uint8_t* pSrc, const uint8_t srcBitOffset, const size_t size)
{
    if (srcBitOffset == 0)
    {
        for (size_t i = 0; i < size; ++i)
        {
            pDest[i] = pSrc[i];
        }
    }
    else if (size > 0)
    {
        uint8_t v0 = pSrc[0];
        for (size_t i = 0; i < size; ++i)
        {
            uint8_t v1 = pSrc[i + 1];
            pDest[i] = (v0 << srcBitOffset) | (v1 >> (CHAR_BIT - srcBitOffset));
            v0 = v1;            
        }
    }
}

(warning: untested code!).

Once this is working then profile it in your application - you may find it's plenty fast enough for your needs and thereby avoid the pitfalls of premature optimisation. If not then you have a useful baseline reference implementation for further optimisation work.

Be aware that for small copies the overhead of testing for alignment and word-sized copies etc may well outweigh any benefits, so a simple byte by byte loop such as the above may well be close to optimal.

Note also that optimisations may well be architecture-dependent - micro-optimisations which give a benefit on one CPU may well be counter-productive on another.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Agree with the solution, and I would note that since the bytes are contiguous, further speedup may be obtained by working with chunks of 32 bit or 64 bit integers (until the last chunk eventually). (verify that big or little endianness doesnt matter)... – A.S.H Aug 17 '15 at 07:37
  • @A.S.H: note that the OP's use case is small copies of 4 - 12 bytes, so the cost of additional logic and the resulting branch mispredictions etc could more than outweigh any benefits from trying to load/store quantities larger than one byte. – Paul R Aug 17 '15 at 07:43
  • 1
    if the size is constant, your algo can be customized without much branch logic, if any. (The OP says "best result would be the one that benches fastest for constant 'size' input, between 4 and 12, for randomly unaligned src buffers.") – A.S.H Aug 17 '15 at 07:49
  • 1
    @A.S.H: well there is one way to find out of course - implement and benchmark - results will be highly CPU-dependent and data-dependent though. If `size` is known at compile time then it might be worth making it a template parameter and do special case implementations as needed. – Paul R Aug 17 '15 at 08:18
  • @PaulR: Actually you probably wouldn't have to make `size` a template parameter if the function is successfully inlined. I tend to expect this from my compiler, but I suppose if desperate I might try forcing it by making it explicitly templated. The drawback is if you ever want it dynamic, you have to copy the function. – VoidStar Aug 17 '15 at 08:25
  • Yes, true - inlining should take care of eliminating dead code when `size` is known at compile time so you could keep it as a function parameter. – Paul R Aug 17 '15 at 08:28
  • @PaulR: branching on `size` is fine, because the OP suggests every call will have the same `size`. So the branch will predict well. I think there's a possibility for a *big* speedup when `size <= 7`, because you can use a single 64bit shift. (Hmm, might need an AVX masked write if we need to write fewer than 8 bytes. Or mask the old contents of the dest block so you can OR it into the shifted src.) If we're allowed to read and rewrite beyond the ends of the src and dest, that could make this possible. (i.e. make sure they're not in the last page of an allocated block.) – Peter Cordes Aug 17 '15 at 12:36
  • @PeterCordes: maybe - I guess I was thinking more of the combined logic for testing for size *and* dealing with alignment of source/dest (since there is some interaction between the two when looking at small copies). I'm not sure whether we can assume x86 either. But anyway, I think it would be a good idea to profile this first - since the data set is apparently very large it's possible that memory access will be the bottleneck and trying to optimise this further may not be very rewarding. – Paul R Aug 17 '15 at 13:32
  • 1
    @PaulR: OP said in the last paragraph that amd64 is the primary target. If unaligned 32 or 64bit loads/stores need to be avoided (slow or fault), then I agree with your conclusion that the logic might outweight just byte-at-a-time, even if though we have to shift and OR every byte separately. I think even an AVX version would be of interest to the OP. (I was thinking about `VPMASKMOV`, but that has dword minimum granularity, which doesn't help.) – Peter Cordes Aug 17 '15 at 13:49
  • 1
    Aha - I missed the edit re AMD64. (It's a pity it's not POWER/PowerPC, as VMX/AltiVec has 128 bit shifts at single bit granularity!). – Paul R Aug 17 '15 at 14:00
3

I think that trivial byte-by-byte solution (see @PaulR's answer) is the best approach for small blocks, unless you can satisfy the following additional constraints:

  1. Input buffer is allocated with some padding, i.e. accessing some bytes after the last one does not crash.
  2. Output buffer is also allocated with some padding, and it does not matter if a few bytes after the desired result location are overwritten. If it does matter, that you'll need to do more stuff to preserve those after-the-end bytes.
  3. Input and output ranges involved do not overlap (including a few more padding bytes after the end), just like in memcpy.

If you can, then it is possible to increase granularity of the algorithm. It is very easy to change @PaulR's answer to use uint64_t words instead of uint8_t bytes everywhere. As a result, it would work faster.

We can use SSE to further increase word size. Since in SSE there is no way to shift the whole register by bits, we have to do two shifts for 64-bit integers, then glue results together. Gluing is done by _mm_shuffle_epi8 from SSSE3, which allows to shuffle bytes in XMM register in arbitrary way. For shifting we use _mm_srl_epi64, because that's the only way to shift 64-bit integers by non-immediate number of bits. I have added restrict keyword from C (as macro) to the pointer arguments, because if they are aliased, the algorithm will not work anyway.

Here is the code:

void OffsetMemCpy_stgatilov(uint8_t *RESTRICT pDest, const uint8_t *RESTRICT pSrc, const uint8_t srcBitOffset, const size_t size) {
    __m128i bits = (sizeof(size_t) == 8 ? _mm_cvtsi64_si128(srcBitOffset) : _mm_cvtsi32_si128(srcBitOffset));
    const uint8_t *pEnd = pSrc + size;
    while (pSrc < pEnd) {
        __m128i input = _mm_loadu_si128((__m128i*)pSrc);
        __m128i reg = _mm_shuffle_epi8(input, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 12, 13, 14));
        __m128i shifted = _mm_srl_epi64(reg, bits);
        __m128i comp = _mm_shuffle_epi8(shifted, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, -1, -1));
        _mm_storeu_si128((__m128i*)pDest, comp);
        pSrc += 14;  pDest += 14;
    }
}

It processes 14 bytes per iteration. Each iteration is rather simple, also there is some code before the loop. Here is the assembly code of the whole function body as generated by MSVC2013 x64:

    movzx   eax, r8b
    movd    xmm3, rax
    lea rax, QWORD PTR [rdx+r9]
    cmp rdx, rax
    jae SHORT $LN1@OffsetMemC
    movdqa  xmm1, XMMWORD PTR __xmm@0e0d0c0b0a0908070706050403020100
    movdqa  xmm2, XMMWORD PTR __xmm@ffff0e0d0c0b0a090806050403020100
    sub rcx, rdx
    npad    11
$LL2@OffsetMemC:
    movdqu  xmm0, XMMWORD PTR [rdx]
    add rdx, 14
    pshufb  xmm0, xmm1
    psrlq   xmm0, xmm3
    pshufb  xmm0, xmm2
    movdqu  XMMWORD PTR [rcx+rdx-14], xmm0
    cmp rdx, rax
    jb  SHORT $LL2@OffsetMemC
$LN1@OffsetMemC:
    ret 0

IACA says the whole function takes 4.5 cycles throughput and 13 cycles latency on Ivy Bridge, given that the loop is executed once and no issues with caches/branches/decoding happen. In benchmark however, 7.5 cycles are spent on one such call on average.

Here are brief results of throughput benchmark on Ivy Bridge 3.4 Ghz (see more results in the code):

(billions of calls per second)
size = 4:
  0.132  (Paul R)
  0.248  (Paul R x64)
  0.45  (stgatilov)
size = 8:
  0.0782  (Paul R)
  0.249  (Paul R x64)
  0.45  (stgatilov)
size = 12:
  0.0559  (Paul R)
  0.191  (Paul R x64)
  0.453  (stgatilov)

Note however, that in real world performance can be drastically different from benchmark results.

Full code with benchmarking and more verbose results are here.

stgatilov
  • 5,333
  • 31
  • 54