I have written two functions that gets the sum of an array, the first one is written in C++ and the other is written with inline assembly (x86-64), I compared the performance of the two functions on my device.
If the -O flag is not enabled during compilation the function with inline assembly is almost 4-5x faster than the C++ version.
cpp time : 543070068 nanoseconds cpp time : 547990578 nanoseconds asm time : 185495494 nanoseconds asm time : 188597476 nanoseconds
If the -O flag is set to -O1 they produce the same performance.
cpp time : 177510914 nanoseconds cpp time : 178084988 nanoseconds asm time : 179036546 nanoseconds asm time : 181641378 nanoseconds
But if I try to set the -O flag to -O2 or -O3 I'm getting an unusual 2-3 digit nanoseconds performance for the function written with inline assembly which is sketchy fast (at least for me, please bear with me since I have no rock solid experience with assembly programming so I don't know how fast or how slow it can be compared to a program written in C++. )
cpp time : 177522894 nanoseconds cpp time : 183816275 nanoseconds asm time : 125 nanoseconds asm time : 75 nanoseconds
My Questions
Why is this array sum function written with inline assembly so fast after enabling -O2 or -O3?
Is this a normal reading or there is something wrong with the timing/measurement of the performance?
Or maybe there is something wrong with my inline assembly function?
And if the inline assembly function for the array sum is correct and the performance reading is correct, why does the C++ compiler failed to optimize a simple array sum function for the C++ version and make it as fast as the inline assembly version?
I have also speculated that maybe the memory alignment and cache misses are improved during compilation to increase the performance but my knowledge on this one is still very very limited.
Apart from answering my questions, if you have something to add please feel free to do so, I hope somebody can explain, thanks!
[EDIT]
So I have removed the use of macro and isolated running the two version and also tried to add volatile keyword, a "memory" clobber and "+&r" constraint for the output and the performance was now the same with the cpp_sum.
Though if I remove back the volatile keyword and "memory" clobber it I'm still getting those 2-3 digit nanoseconds performance.
code:
#include <iostream>
#include <random>
#include <chrono>
uint64_t sum_cpp(const uint64_t *numbers, size_t length) {
uint64_t sum = 0;
for(size_t i=0; i<length; ++i) {
sum += numbers[i];
}
return sum;
}
uint64_t sum_asm(const uint64_t *numbers, size_t length) {
uint64_t sum = 0;
asm volatile(
"xorq %%rax, %%rax\n\t"
"%=:\n\t"
"addq (%[numbers], %%rax, 8), %[sum]\n\t"
"incq %%rax\n\t"
"cmpq %%rax, %[length]\n\t"
"jne %=b"
: [sum]"+&r"(sum)
: [numbers]"r"(numbers), [length]"r"(length)
: "%rax", "memory", "cc"
);
return sum;
}
int main() {
std::mt19937_64 rand_engine(1);
std::uniform_int_distribution<uint64_t> random_number(0,5000);
size_t length = 99999999;
uint64_t *arr = new uint64_t[length];
for(size_t i=1; i<length; ++i) arr[i] = random_number(rand_engine);
uint64_t cpp_total = 0, asm_total = 0;
for(size_t i=0; i<5; ++i) {
auto start = std::chrono::high_resolution_clock::now();
#ifndef _INLINE_ASM
cpp_total += sum_cpp(arr, length);
#else
asm_total += sum_asm(arr,length);
#endif
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
std::cout << "time : " << dur.count() << " nanoseconds\n";
}
#ifndef _INLINE_ASM
std::cout << "cpp sum = " << cpp_total << "\n";
#else
std::cout << "asm sum = " << asm_total << "\n";
#endif
delete [] arr;
return 0;
}