Why is Matlab 11 times faster than C++

Question

I am comparing the speed of a Monte Carlo pricing algorithm for a Vanilla call option between Matlab and C++. This is not the same as Why is MATLAB so fast in matrix multiplication? since the speed-up is not due to matrix multiplication (there is only a dot product which is done quickly) but seems to be due to its highly efficient Gaussian random number generator.

In Matlab the code has been vectorised and the code looks as follows

function [ value ] = OptionMCValue( yearsToExpiry, spot, strike, riskFreeRate, dividendYield, volatility, numPaths  )

    sd = volatility*sqrt(yearsToExpiry);
    sAdjusted = spot * exp( (riskFreeRate - dividendYield - 0.5*volatility*volatility) * yearsToExpiry);

    g = randn(1,numPaths);
    sT = sAdjusted * exp( g * sd );
    values = max(sT-strike,0);`
    value = mean(values);
    value = value * exp(-riskFreeRate * yearsToExpiry);

end

If I run this with 10 million paths as follows

strike = 100.0;
yearsToExpiry = 2.16563;
spot = 100.0;
volatility = 0.20;
dividendYield = 0.03;
riskFreeRate = 0.05;
oneMillion = 1000000;
numPaths = 10*oneMillion;

tic
value = OptionMCValue( yearsToExpiry, spot, strike, riskFreeRate, dividendYield, volatility, numPaths  );
toc

I get

Elapsed time is 0.359304 seconds.
   12.8311

Now I do the same thing in C++ in VS2013

My code is in an OptionMC class and is as follows

double OptionMC::value(double yearsToExpiry, 
                   double spot,
                   double riskFreeRate,
                   double dividendYield,
                   double volatility, 
                   unsigned long numPaths )
{
    double sd = volatility*sqrt(yearsToExpiry);
    double sAdjusted = spot * exp( (riskFreeRate - dividendYield - 0.5*volatility*volatility) * yearsToExpiry);
    double value = 0.0;
    double g, sT;

    for (unsigned long i = 0; i < numPaths; i++)
    {
        g = GaussianRVByBoxMuller();
        sT = sAdjusted * exp(g * sd);
        value += Max(sT - m_strike, 0.0);
    }

    value = value * exp(-riskFreeRate * yearsToExpiry);
    value /= (double) numPaths;
    return value;
}

The BM code is as follows

double GaussianRVByBoxMuller()
{
double result;
double x; double y;;
double w;

do
{
    x = 2.0*rand() / static_cast<double>(RAND_MAX)-1;
    y = 2.0*rand() / static_cast<double>(RAND_MAX)-1;
    w = x*x + y*y;
} while (w >= 1.0);

w = sqrt(-2.0 * log(w) / w);
result = x*w;

return result;
}

I have set the Optimization option to Optimize for speed in Visual Studio.

For 10m paths it takes 4.124 seconds.

This is 11 times slower than Matlab.

Can anyone explain the difference ?

EDIT: On further testing the slow down does seem to be the call to GaussianRVByBoxMuller. Matlab seems to have a very efficient implementation - the Ziggurat method. Note that BM is sub-optimal here as it generates 2 RVs and I only use 1. Fixing just this would give a factor of 2 speed-up.

The optimizations used in Matlab are not the same as the C++ compiler does to your code. — Some programmer dude, Mar 21 '16 at 12:14
Check if you're using the .net framework. Not sure if that would affect. — xvan, Mar 21 '16 at 12:14
It is possible that Matlab is multithreading internally the various function calls... just a guess though — Emerald Weapon, Mar 21 '16 at 12:28
Ah, the good old times when it was sufficient to simply re-write in C++ and you had a guaranteed speed-up! Nowadays, Matlab is a lot more optimized, and you need to start writing very good C++ code and use fast libraries to robustly see a speed increase. — Jonas, Mar 21 '16 at 12:33
@Jonas Your comment is very interesting. Why do you say it? The automatic conversion of Matlab, from .m code to machine-readable code (like JAVA or C) is so efficient? — Adiel, Mar 21 '16 at 12:42
Are you sure, you are not running any other C++ code. I just tried wrapping your C++ function in a mex function and using `g=1` it is around 25 times faster! If I use `std::normal_distribution` for random number generation it is on par. — Jørgen, Mar 21 '16 at 13:39
`GaussianRVByBoxMuller` is always created and destroyed in the loop. Since all code is just double computation, I suppose this piece of codes (constructor and destructor) are hiding something ! — norisknofun, Mar 21 '16 at 13:42
Just to make sure - is it possible that you are using debug mode compilation instead of release? — ibezito, Mar 21 '16 at 14:23
Be sure to read the second-top-voted answer on the marked duplicate. — Andras Deak -- Слава Україні, Mar 21 '16 at 19:05
@drorco I think so. I switched the optimization flag to maximize speed — Dom, Mar 21 '16 at 20:45
@jorgen I tried using std::normal_distribution but it had only a tiny effect — Dom, Mar 21 '16 at 21:11
@Dominic Is your C++ code reduced to only the above function? — Jørgen, Mar 21 '16 at 21:20
Not exactly sure what you mean. I create a VanillaOption object and then call the value function. I then time it by looking at system clock before and after. Nothing else happens. — Dom, Mar 21 '16 at 21:37
All I meant was that It looks like your code was part of some larger class. In the end, it was probably as suggested, that Matlab takes advantage of parallel execution. — Jørgen, Mar 23 '16 at 08:33

Jerry Coffin · Accepted Answer · 2016-03-22T06:01:59.037

As it stands right now, you're generating single-threaded code. At a guess, Matlab is using multi-threaded code. That allows it to run faster by a factor of approximately N, where N = the number of cores in your CPU.

There's a little more to the story than that though. One more problem that arises is that you're using rand(), which uses hidden, global state. As such, if you do a simple rewrite of your code to make it multi-threaded, chances are pretty good that conflicts over rand()'s internal state will prevent you from getting much speed improvement (and might easily run slower--perhaps quite a bit slower).

To cure that, you might consider (for example) using the new random number generation (and possibly distribution) classes added in C++11. With these, you can create a separate instance of the random number generator for each thread, preventing conflict over their internal state.

I rewrote your code a little bit to use those, and call the function, to get this:

double m_strike = 100.0;

class generator {
    std::normal_distribution<double> dis;
    std::mt19937_64 gen;
public:
    generator(double lower = 0.0, double upper = 1.0)
        : gen(std::random_device()()), dis(lower, upper) {}

    double operator()() {
        return dis(gen);
    }
};

double value(double yearsToExpiry,
    double spot,
    double riskFreeRate,
    double dividendYield,
    double volatility,
    unsigned long numPaths)
{
    double sd = volatility*sqrt(yearsToExpiry);
    double sAdjusted = spot * exp((riskFreeRate - dividendYield - 0.5*volatility*volatility) * yearsToExpiry);
    double value = 0.0;
    double g, sT;

    generator gen;

// run iterations in parallel, with a private random number generator for each thread:
#pragma omp parallel for reduction(+:value) private(gen)
    for (long i = 0; i < numPaths; i++)
    {
        g = gen(); // GaussianRVByBoxMuller();
        sT = sAdjusted * exp(g * sd);
        value += std::max(sT - m_strike, 0.0);
    }

    value = value * exp(-riskFreeRate * yearsToExpiry);
    value /= (double)numPaths;
    return value;
}

int main() {
    std::cout << "value: " << value(2.16563, 100.0, 0.05, 0.03, 0.2, 10'000'000) << "\n";
}

I compiled this with VC++ 2015, using the following command line:

cl -openmp -Qpar -arch:AVX -O2b2 -GL test.cpp

On an AMD A8-7600 this ran in ~.31 seconds.
On an Intel i7 processor, this ran in ~.16 seconds.

Of course, if you have a CPU with a lot more cores, you stand a decent chance of this running quite a bit faster still.

As it stands right now, my code requires VC++ 2015 instead of 2013, but I doubt that it affects performance much. It's mostly just convenience, like using 10'000'000 instead of 10000000 (but I'm not going to install a copy of 2013 on this machine just to figure out exactly what I need to change to accommodate it).

Also note than on a recent Intel processor, you might (or might not) gain some improvement by changing the arch:AVX to arch:AVX2 instead.

A quick check of single-threaded code indicates that your Box-Muller distribution code may be a tad faster than the standard library's normal distribution code, so switching to a thread-friendly version of that might gain a little more speed (and the optimized version should approximately double that as well).

Thanks! I made the change to multi-threaded std and thanks to that I got the time down to 1.44 seconds from 4.124 seconds which is a good factor of 3 improvement - my Intel i5 processor has a quad core. Changing the architecture had only a tiny impact as did the other command line options. I think some of the rest of the performance difference must be due to Matlab's Ziggurat Gaussian RV generator but the remaining factor of four still seems too high to me. — Dom, Mar 22 '16 at 07:05

Why is Matlab 11 times faster than C++

1 Answers1