13

I have two arrays and I want to copy one array into the other with some stride. For example, I have

A A A A A A A A ...

B B B B B B B B ...

and I want to copy every three elements of B to A to obtain

B A A B A A B A ...

From the post "Is there a standard, strided version of memcpy?", it seems that there is no such a possibility in C.

However, I have experienced that, in some cases, memcpy is faster than a for loop based copy.

My question is; Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?

Thank you very much.

EDIT - CLARIFICATION OF THE PROBLEM

To make the problem clearer, let us denote the two arrays at hand by a and b. I have a function that performs the unique following for loop

for (int i=0; i<NumElements, i++)
    a_[i] = b_[i];

where both the []'s are overloaded operators (I'm using an expression templates technique) so that they can be actually mean, for example

 a[3*i]=b[i];
Community
  • 1
  • 1
Vitality
  • 20,705
  • 4
  • 108
  • 146
  • A standard for loop performs at least as fast as a standard for loop... Sarcasm aside, it depends on the data storage structure youre using. For arrays, I don't think you can do any better than a for loop, incremented by your modulus. – MobA11y Jun 13 '13 at 15:26
  • `memcpy` is sometimes faster than a `for` loop due to optimizes it can perform because the memory it's operating on is contiguous. Those optimizations can't be made here. – Collin Dauphinee Jun 13 '13 at 19:29
  • @dauphic But then why CUDA has `cudaMemcpy2D` which copies with pitch? – Vitality Jun 13 '13 at 20:34
  • @JackOLantern: CUDA operates in parallel. – Collin Dauphinee Jun 13 '13 at 21:05
  • @JackOLantern Because `cudaMemcpy2D` executes in parallel on a GPU installed on the device and `memcpy` executes on the device itself. – Captain Obvlious Jun 13 '13 at 21:06
  • @dauphic @CaptainObvlious In parallel programming on GPUs, the term "device" usually stands for GPU and "host" for CPU :-) But anyway, what I'm trying to say is that memory transactions in CUDA are optimized for a coalesced access pattern and surely not for strided access and, nevertheless, a strided access is available by `cudaMemcpy2D`. – Vitality Jun 13 '13 at 21:09

2 Answers2

12

Might be a too specific answer, but on an ARM platform that supports NEON, NEON vectorization can be used to make strided copy even faster. This could be life-saving in an environment where resources are relatively more limited, which is probably why ARM is used in that setting in the first place. A prominent example is Android where most devices still use the ARM v7a architecture which supports NEON.

The following examples demonstrate this, it is a loop to copy the semi-planar UV plane of a YUV420sp image into the planar UV plane of a YUV420p image. The sizes of the source and destination buffers are both 640*480/2 bytes. All of the examples are compiled with the g++ 4.8 inside Android NDK r9d. They are executed on a Samsung Exynos Octa 5420 processor:

Level 1: Regular

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    for(int i=0;i<stride;i++){
        dstptr[i]           = srcptr[i*2];
        dstptr[i + stride]  = srcptr[i*2 + 1];
    }
}

Compiled with -O3 only, takes about 1.5 ms on average.

Level 2: Unrolled and squeezed a bit more with moving pointers

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    unsigned char* endptr = dstptr + stride;
    while(dstptr<endptr){
        *(dstptr + 0)             = *(srcptr + 0);
        *(dstptr + stride + 0)    = *(srcptr + 1);
        *(dstptr + 1)             = *(srcptr + 2);
        *(dstptr + stride + 1)    = *(srcptr + 3);
        *(dstptr + 2)             = *(srcptr + 4);
        *(dstptr + stride + 2)    = *(srcptr + 5);
        *(dstptr + 3)             = *(srcptr + 6);
        *(dstptr + stride + 3)    = *(srcptr + 7);
        *(dstptr + 4)             = *(srcptr + 8);
        *(dstptr + stride + 4)    = *(srcptr + 9);
        *(dstptr + 5)             = *(srcptr + 10);
        *(dstptr + stride + 5)    = *(srcptr + 11);
        *(dstptr + 6)             = *(srcptr + 12);
        *(dstptr + stride + 6)    = *(srcptr + 13);
        *(dstptr + 7)             = *(srcptr + 14);
        *(dstptr + stride + 7)    = *(srcptr + 15);
        srcptr+=16;
        dstptr+=8;
    } 
}

Compiled with -O3 only, takes about 1.15 ms on average. This is probably as fast as it gets on a regular architecture, as per the other answer.

Level 3: Regular + GCC automatic NEON vectorization

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    for(int i=0;i<stride;i++){
        dstptr[i]           = srcptr[i*2];
        dstptr[i + stride]  = srcptr[i*2 + 1];
    }
}

Compiled with -O3 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=1 -mfloat-abi=softfp, takes about 0.6 ms on average. For reference, a memcpy of 640*480 bytes, or double the amount of what's tested here, takes about 0.6 ms on average.

As a side note, the second code (unrolled and pointered) compiled with the NEON parameters above takes about the same amount of time, 0.6 ms.

Ayberk Özgür
  • 4,986
  • 4
  • 38
  • 58
8

Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?

Edit 2: There is no function for strided copying in the C++ libraries.

Since strided copying is not as popular a memory copying, chip manufacturers nor language designs have specialized support for strided copying.

Assuming a standard for loop, you may be able to gain some performance by using Loop Unrolling. Some compilers have options to unroll loops; it's not a "standard" option.

Given a standard for loop:

#define RESULT_SIZE 72
#define SIZE_A 48
#define SIZE_B 24

unsigned int A[SIZE_A];
unsigned int B[SIZE_B];
unsigned int result[RESULT_SIZE];

unsigned int index_a = 0;
unsigned int index_b = 0;
unsigned int index_result = 0;
for (index_result = 0; index_result < RESULT_SIZE;)
{
   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 
}

Loop unrolling would repeat the contents of the "standard" for loop:

for (index_result = 0; index_result < RESULT_SIZE;)
{
   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 

   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 
}

In the unrolled version, the number of loops has been cut in half.

The performance improvement may be negligible compared to other options. The following issues affect performance and each may have different speed improvements:

  • Processing data cache misses
  • Reloading of instruction pipeline (depends on processor)
  • Operating System swapping memory with disk
  • Other tasks running concurrently
  • Parallel processing (depends on processor / platform)

One example of parallel processing is to have one processor copy the B items to the new array and another processor copy the A items to the new array.

vallentin
  • 23,478
  • 6
  • 59
  • 81
Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • Thanks for your kind answer. I have edited my post to better explain the problem. Do you think that by a `#pragma unroll` I have a chance to improve the situation? I don't know, since everything about the copy is known at run-time. – Vitality Jun 13 '13 at 21:14
  • As I said, depends on the processor. With some processors, a branch flushes the instruction pipeline and the processor has to reload it. Some modern processors have enough instruction pipeline cache that they hold a simple for loop in the instruction cache and not have to reload. – Thomas Matthews Jun 13 '13 at 21:20
  • Most processors prefer executing sequential instructions and would prefer not to encounter branch instructions. – Thomas Matthews Jun 13 '13 at 21:22
  • My preference is not to use a compiler `pragma` but to unroll the loops. This allows more portable code and I can control the multiplicity of content. In my work, I have unrolled a loop that processes a 32 slot FIFO with 32 separate statements; works really fast. – Thomas Matthews Jun 13 '13 at 21:24
  • I understand your point that `pragma` can introduce code portability issues, but in general I do not know much about the mentioned `for` loop, so that I cannot perform the loop unroll "by hand". Could you please edit your post providing a short, definitive statement that there is no "built-in" C++ function to perform strided copies, so that I could accept your answer? Thanks. – Vitality Jun 14 '13 at 12:43