fastest way to blit image buffer into an xy offset of another buffer in C++ on amd64 architecture

Question

I have image buffers of an arbitrary size that I copy into equal-sized or larger buffers at an x,y offset. The colorspace is BGRA. My current copy method is:

void render(guint8* src, guint8* dest, uint src_width, uint src_height, uint dest_x, uint dest_y, uint dest_buffer_width) {
    bool use_single_memcpy = (dest_x == 0) && (dest_y == 0) && (dest_buffer_width == src_width);

    if(use_single_memcpy) {
        memcpy(dest, src, src_width * src_height * 4);
    }
    else {
        dest += (dest_y * dest_buffer_width * 4);
        for(uint i=0;i < src_height;i++) {
            memcpy(dest + (dest_x * 4), src, src_width * 4);
            dest += dest_buffer_width * 4;
            src += src_width * 4;
        }
    }
}

It runs fast but I was curious if there was anything I could do to improve it and gain a few extra milliseconds. If it involves going to assembly code I'd prefer to avoid that, but I'm willing to add additional libraries.

score 2 · Answer 1 · edited May 23 '17 at 12:15

One popular answer on StackOverflow, that does use x86-64 assembly and SSE can be found here: Very fast memcpy for image processing?. If you do use this code, you'll need to make sure your buffers are 128-bit aligned. A basic explanation for that code is that:

Non-temporal stores are used, so unnecessary cache writes can be bypassed and writes to main memory can be combined.
Reads and writes are interleaved in only very large chunks (doing many reads and then many writes). Performing many reads back-to-back typically has better performance than single read-write-read-write patterns.
Much larger registers are used (128 bit SSE registers).
Prefetch instructions are included as hints to the CPU's pipelining.

I found this document - Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540 - which seems to be the inspiration of the above code, but for older processor generations; however, it does contain a significant amount of discussion on how it works.

For instance, consider this discussion on write-combining / non-temporal stores:

The Pentium II and III CPU caches operate on 32-byte cache-line sized blocks. When data is written to or read from (cached) memory, entire cache lines are read or written. While this generally enhances CPU-memory performance, under some conditions it can lead to unnecessary data fetches. In particular, consider a case where the CPU will do an 8-byte MMX register store: movq. Since this is only one quarter of a cache line, it will be treated as a read-modify-write operation from the cache's perspective; the target cache line will be fetched into cache, then the 8-byte write will occur. In the case of a memory copy, this fetched data is unnecessary; subsequent stores will overwrite the remainder of the cache line. The read-modify-write behavior can be avoided by having the CPU gather all writes to a cache line then doing a single write to memory. Coalescing individual writes into a single cache-line write is referred to as write combining. Write combining takes place when the memory being written to is explicitly marked as write combining (as opposed to cached or uncached), or when the MMX non-temporal store instruction is used. Memory is generally marked write combining only when it is used in frame buffers; memory allocated by VirtualAlloc is either uncached or cached (but not write combining). The MMX movntps and movntq non-temporal store instructions instruct the CPU to write the data directly to memory, bypassing the L1 and L2 caches. As a side effect, it also enables write combining if the target memory is cached.

If you'd prefer to stick with memcpy, consider investigating the source code for the memcpy implementation you're using. Some memcpy implementations look for native-word-aligned buffers to improve performance by using the full register size; others will automatically copy as much as possible using native-word-aligned and then mop-up the remainders. Making sure your buffers are 8-byte aligned will facilitate these mechanisms.

Some memcpy implementations contain a ton of up-front conditionals to make it efficient for small buffers (<512) - you may want to consider a copy-paste of the code with those chunks ripped out since you're presumably not working with small buffers.

score 1 · Accepted Answer · answered Apr 30 '15 at 15:52

Your use_single_memcpy test is too restrictive. A slight rearrangement allows you to remove the dest_y == 0 requirement.

void render(guint8* src, guint8* dest,
            uint src_width, uint src_height, 
            uint dest_x, uint dest_y,
            uint dest_buffer_width)
{
    bool use_single_memcpy = (dest_x == 0) && (dest_buffer_width == src_width);
    dest_buffer_width <<= 2;
    src_width <<= 2;
    dest += (dest_y * dest_buffer_width);

    if(use_single_memcpy) {
        memcpy(dest, src, src_width * src_height);
    }
    else {
        dest += (dest_x << 2);
        while (src_height--) {
            memcpy(dest, src, src_width);
            dest += dest_buffer_width;
            src += src_width;
        }
    }
}

I've also changed the loop to a countdown (which may be more efficient) and removed a useless temporary variable, and lifted repeated calculations.

It's likely that you can do even better using SSE intrinsics to copy 16 bytes at a time instead of 4, but then you'll need to worry about alignment and multiples of 4 pixels. A good memcpy implementation should already do these things.

fastest way to blit image buffer into an xy offset of another buffer in C++ on amd64 architecture

2 Answers2