I have been looking at fast ways to copy various amounts of data, when NEON vector instructions are available on an ARM device.
I've done some benchmarks, and have some interesting results. I'm trying to understand what I'm looking at.
I have got four versions to copy data:
1. Baseline
Copies element by element:
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
2. NEON
This code loads four values into a temporary register, then copies the register to output.
Thus the number of loads are reduced by half. There may be a way to skip the temporary register and reduce the loads by one quarter, but I haven't found a way.
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(©2[i], tmp); // copy 4 elements from tmp SIMD register
}
3. Stepped memcpy
,
Uses the memcpy
, but copies 4 elements at a time. This is to compare against the NEON version.
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
4. Normal memcpy
Uses memcpy
with full amount of data.
memcpy(orig, copy4, size);
My benchmark using 2^16
values gave some surprising results:
1. Baseline time = 3443[µs]
2. NEON time = 1682[µs]
3. memcpy (stepped) time = 1445[µs]
4. memcpy time = 81[µs]
The speedup for NEON time is expected, however the faster stepped memcpy
time is surprising to me. And the time for 4
even more so.
Why is memcpy
doing so well? Does it use NEON under-the-hood? Or are there efficient memory copy instructions I am not aware of?
This question discussed NEON versus memcpy()
. However I don't feel the answers explore sufficently why the ARM memcpy
implementation runs so well
The full code listing is below:
#include <arm_neon.h>
#include <vector>
#include <cinttypes>
#include <iostream>
#include <cstdlib>
#include <chrono>
#include <cstring>
int main(int argc, char *argv[]) {
int arr_size;
if (argc==1)
{
std::cout << "Please enter an array size" << std::endl;
exit(1);
}
int size = atoi(argv[1]); // not very C++, sorry
std::int32_t* orig = new std::int32_t[size];
std::int32_t* copy = new std::int32_t[size];
std::int32_t* copy2 = new std::int32_t[size];
std::int32_t* copy3 = new std::int32_t[size];
std::int32_t* copy4 = new std::int32_t[size];
// Non-neon version
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Baseline time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// NEON version
begin = std::chrono::steady_clock::now();
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(©2[i], tmp); // copy 4 elements from tmp SIMD register
}
end = std::chrono::steady_clock::now();
std::cout << "NEON time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
memcpy(orig, copy4, size);
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
return 0;
}