I'll preface this by saying C++ is not my typical area of work, I'm more often in C# and Matlab. I also don't pretend to be able to read x86 assembly code. Having seen some videos recently though on "modern c++" and new instructions on latest processors, I figured I'd poke around a bit more and see what I can learn. I do have some existing C++ DLL's which benefit from speed improvements - those DLL's using many trig and power operations from <cmath>
.
So I whip up a simple benchmark program in VS2013 Express / Desktop. Processor on my machine here is an Intel i7-4800MQ (Haswell). Program is pretty simple, allocates some std::vector<double>
's to a size of 5 million random entries, then loops over doing some math operation combining the values. I measure the time spent using std::chrono::high_resolution_clock::now()
immediately preceding and following the loop:
[Edit: Including full program code]
#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>
int _tmain(int argc, _TCHAR* argv[])
{
// Set up random number generator
std::tr1::mt19937 eng;
std::tr1::normal_distribution<float> dist;
// Number of calculations to do
uint32_t n_points = 5000000;
// Input vectors
std::vector<double> x1;
std::vector<double> x2;
std::vector<double> x3;
// Output vectors
std::vector<double> y1;
// Initialize
x1.reserve(n_points);
x2.reserve(n_points);
x3.reserve(n_points);
y1.reserve(n_points);
// Fill inputs
for (size_t i = 0; i < n_points; i++)
{
x1.push_back(dist(eng));
x2.push_back(dist(eng));
x3.push_back(dist(eng));
}
// Start timer
auto start_time = std::chrono::high_resolution_clock::now();
// Do math loop
for (size_t i = 0; i < n_points; i++)
{
double result_value;
result_value = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);
y1.push_back(result_value);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Duration: " << duration.count() << " ms";
return 0;
}
I put VS into Release configuration with standard options (e.g. /O2). I do one build with /arch:IA32 and run it a few times, and another with /arch:AVX and run it a few times. Consistently, putting the AVX option is ~3.6x slower than the IA32 alternative. In this specific example, to the tune of 773 ms compared to 216.
As a sanity check I did try some other very basic operations.. combination of mults and adds.. taking some number to the 8th power.. and between the two AVX is at least as fast if not a bit faster. So why might my code above be impacted to much? Or where might I look to find out?
Edit 2: At the suggestion of someone on Reddit, I changed the code around into something more vectorize-able... which makes both SSE2 and AVX run faster, but AVX is still much slower than SSE2:
#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>
int _tmain(int argc, _TCHAR* argv[])
{
// Set up random number generator
std::tr1::mt19937 eng;
std::tr1::normal_distribution<double> dist;
// Number of calculations to do
uint32_t n_points = 5000000;
// Input vectors
std::vector<double> x1;
std::vector<double> x2;
std::vector<double> x3;
// Output vectors
std::vector<double> y1;
// Initialize
x1.reserve(n_points);
x2.reserve(n_points);
x3.reserve(n_points);
y1.reserve(n_points);
// Fill inputs
for (size_t i = 0; i < n_points; i++)
{
x1.push_back(dist(eng));
x2.push_back(dist(eng));
x3.push_back(dist(eng));
y1.push_back(0.0);
}
// Start timer
auto start_time = std::chrono::high_resolution_clock::now();
// Do math loop
for (size_t i = 0; i < n_points; i++)
{
y1[i] = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Duration: " << duration.count() << " ms";
return 0;
}
IA32: 209 ms SSE: 205 ms SSE2: 75 ms AVX: 371 ms
As for specific version of Visual Studio, this is 2013 Express for Desktop Update 1 (Version 12.0.30110.00 Update 1)