0

I am trying to copy from an array of arrays, to another one, while leaving a space between arrays in the target.

They are both contiguous each vector sizes size is between 5000 and 52000 floats,
Output_jump is the vector size times eight, and vector_count vary in my tests.

I did the best I learned here https://stackoverflow.com/a/34450588/1238848 and here https://stackoverflow.com/a/16658555/1238848

but still it seems so slow.

void copyToTarget(const float *input, float *output, int vector_count, int vector_size, int output_jump)
{
    int left_to_do,offset;
    constexpr int block=2048;
    constexpr int blockInBytes = block*sizeof(float);
    float temp[2048];

    for (int i = 0; i < vector_count; ++i)
    {
        left_to_do = vector_size;
        offset = 0;
        while(left_to_do > block)
        {
            memcpy(temp, input, blockInBytes);
            memcpy(output, temp, blockInBytes);
            left_to_do -= block;
            input += block;
            output += block;
        }

        if (left_to_do)
        {
            memcpy(temp, input, left_to_do*sizeof(float));
            memcpy(output, temp, left_to_do*sizeof(float));
            input += left_to_do;
            output += left_to_do;
        }

        output += output_jump;
    }
}
tadman
  • 208,517
  • 23
  • 234
  • 262
Amir Ofir
  • 126
  • 3
  • 1
    Why do you think copying into a `temp` variable before copying the whole thing again to its destination is faster than copying from source directly to the destination? And what led you to conclude that doing making many small copies, one small chunk at a time, is faster than just a single `memcpy`, since everything is continuous? – Sam Varshavchik Nov 09 '20 at 23:35
  • 1
    This would be a lot easier if `std::vector` was involved. These have a built-in length so no additional parameter is required. You can also combine `std::vector` trivially with `a + b`. – tadman Nov 09 '20 at 23:38
  • Copying to `temp` is pointless and wasteful. Get rid of that and you'll double your speed. – tadman Nov 09 '20 at 23:39
  • 1
    Why do you manually break your copying into 2,048 floats chunks? I'm sure `memcpy` can handle the required size, all at once. – Vlad Feinstein Nov 09 '20 at 23:56
  • `but still it seems so slow.` Was your attempt faster or slower than plain memcpy? – eerorika Nov 10 '20 at 00:56
  • @tadman I understood that using `temp` can benefit in two parameters - first is the constexpr size that is preferable, second is hoping that `temp` will be cached. – Amir Ofir Nov 10 '20 at 16:50
  • It's used once and then trashed. Why would you want that cached, and what purpose would that serve? Not sure how `constexpr` factors in here either. This function cannot be one, there's no return value and it manipulates arguments. If you copy directly to the target then that target's memory should be "warm" from a caching perspective. Here you copy twice, effectively halving the effectiveness of any caching since you clutter it up with two copies. – tadman Nov 10 '20 at 21:28

1 Answers1

1

I'm skeptical of the answer you linked, which encourages avoiding a function call to memcpy. Surely the implementation of memcpy is very well optimized, probably hand written in assembly, and therefore hard to beat! Moreover for large-sized copies, the function call overhead is negligible compared to memory access latency. So simply calling memcpy is likely the fastest way to copy contiguous bytes around in memory.

If output_jump were zero, a single call to memcpy can copy input directly to output (and this would be hard to beat). For nonzero output_jump, the copy needs to be divided up over the contiguous vectors. Use one memcpy per vector, without the temp buffer, copying directly from input + i * vector_size to output + i * (vector_size + output_jump).

But better yet, like the top answer on that thread suggests, try if possible to find a way to avoid copying data in the first place.

Pascal Getreuer
  • 2,906
  • 1
  • 5
  • 14