Why is valarray so slow?

Question

I am trying to use valarray since it is much like MATLAB while operating vector and matrices. I first did some performance check and found that valarray cannot achieve the performance declared as in the book C++ programming language by Stroustrup.

The test program actually did 5 million multiplication of doubles. I thought that c = a*b would at least be comparable to the for loop double type element multiplication, but I am totally wrong. I tried on several computers and Microsoft Visual C++ 6.0 and Visual Studio 2008.

By the way, I tested on MATLAB using the following code:

len = 5*1024*1024;
a = rand(len, 1);
b = rand(len, 1);
c = zeros(len, 1);
tic;
c = a.*b;
toc;

And the result is 46 ms. This time is not high precision; it only works as a reference.

The code is:

#include <iostream>
#include <valarray>
#include <iostream>
#include "windows.h"

using namespace std;
SYSTEMTIME stime;
LARGE_INTEGER sys_freq;

double gettime_hp();

int main()
{
    enum { N = 5*1024*1024 };
    valarray<double> a(N), b(N), c(N);
    QueryPerformanceFrequency(&sys_freq);
    int i, j;
    for (j=0 ; j<8 ; ++j)
    {
        for (i=0 ; i<N ; ++i)
        {
            a[i] = rand();
            b[i] = rand();
        }

        double* a1 = &a[0], *b1 = &b[0], *c1 = &c[0];
        double dtime = gettime_hp();
        for (i=0 ; i<N ; ++i)
            c1[i] = a1[i] * b1[i];
        dtime = gettime_hp()-dtime;
        cout << "double operator* " << dtime << " ms\n";

        dtime = gettime_hp();
        c = a*b ;
        dtime = gettime_hp() - dtime;
        cout << "valarray operator* " << dtime << " ms\n";

        dtime = gettime_hp();
        for (i=0 ; i<N ; ++i)
            c[i] = a[i] * b[i];
        dtime = gettime_hp() - dtime;
        cout << "valarray[i] operator* " << dtime<< " ms\n";

        cout << "------------------------------------------------------\n";
    }
}

double gettime_hp()
{
    LARGE_INTEGER tick;
    extern LARGE_INTEGER sys_freq;
    QueryPerformanceCounter(&tick);
    return (double)tick.QuadPart * 1000.0 / sys_freq.QuadPart;
}

The running results: (release mode with maximal speed optimization)

double operator* 52.3019 ms
valarray operator* 128.338 ms
valarray[i] operator* 43.1801 ms
------------------------------------------------------
double operator* 43.4036 ms
valarray operator* 145.533 ms
valarray[i] operator* 44.9121 ms
------------------------------------------------------
double operator* 43.2619 ms
valarray operator* 158.681 ms
valarray[i] operator* 43.4871 ms
------------------------------------------------------
double operator* 42.7317 ms
valarray operator* 173.164 ms
valarray[i] operator* 80.1004 ms
------------------------------------------------------
double operator* 43.2236 ms
valarray operator* 158.004 ms
valarray[i] operator* 44.3813 ms
------------------------------------------------------

Debugging mode with same optimization:

double operator* 41.8123 ms
valarray operator* 201.484 ms
valarray[i] operator* 41.5452 ms
------------------------------------------------------
double operator* 40.2238 ms
valarray operator* 215.351 ms
valarray[i] operator* 40.2076 ms
------------------------------------------------------
double operator* 40.5859 ms
valarray operator* 232.007 ms
valarray[i] operator* 40.8803 ms
------------------------------------------------------
double operator* 40.9734 ms
valarray operator* 234.325 ms
valarray[i] operator* 40.9711 ms
------------------------------------------------------
double operator* 41.1977 ms
valarray operator* 234.409 ms
valarray[i] operator* 41.1429 ms
------------------------------------------------------
double operator* 39.7754 ms
valarray operator* 234.26 ms
valarray[i] operator* 39.6338 ms
------------------------------------------------------

Did you run the executable ? or did you try it in a debugger (through visual studio) ? — Yochai Timmer, Jul 27 '11 at 20:34
@alan:setting is: /nologo /ML /W3 /GX /O2 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /Fp"Release/compvarray.pch" /YX /Fo"Release/" also compiled on vs2008, running the same — shangping, Jul 27 '11 at 20:39
Results on MSVS2010 differ, but trends are the same: 17.5872, 51.9515, 25.5569. — eugene_che, Jul 27 '11 at 20:44
On GCC 4.6.1 with `-flto -O3 -march=native -std=c++0x`, I get nearly identical performance for all three cases, with a tiny increase from first to third. — Kerrek SB, Jul 27 '11 at 20:56
[Why is valarray so slow on Visual Studio 2015?](https://stackoverflow.com/q/56050322/995714) — phuclv, Dec 18 '19 at 05:41

Paul R · Answer 1 · 2011-07-28T06:54:20.990

I just tried it on a Linux x86-64 system (Sandy Bridge CPU):

gcc 4.5.0:

double operator* 9.64185 ms
valarray operator* 9.36987 ms
valarray[i] operator* 9.35815 ms

Intel ICC 12.0.2:

double operator* 7.76757 ms
valarray operator* 9.60208 ms
valarray[i] operator* 7.51409 ms

In both cases I just used -O3 and no other optimisation-related flags.

It looks like the MS C++ compiler and/or valarray implementation suck.

Here's the OP's code modified for Linux:

#include <iostream>
#include <valarray>
#include <iostream>
#include <ctime>

using namespace std ;

double gettime_hp();

int main()
{
    enum { N = 5*1024*1024 };
    valarray<double> a(N), b(N), c(N) ;
    int i,j;
    for(  j=0 ; j<8 ; ++j )
    {
        for(  i=0 ; i<N ; ++i )
        {
            a[i]=rand();
            b[i]=rand();
        }

        double* a1 = &a[0], *b1 = &b[0], *c1 = &c[0] ;
        double dtime=gettime_hp();
        for(  i=0 ; i<N ; ++i ) c1[i] = a1[i] * b1[i] ;
        dtime=gettime_hp()-dtime;
        cout << "double operator* " << dtime << " ms\n" ;

        dtime=gettime_hp();
        c = a*b ;
        dtime=gettime_hp()-dtime;
        cout << "valarray operator* " << dtime << " ms\n" ;

        dtime=gettime_hp();
        for(  i=0 ; i<N ; ++i ) c[i] = a[i] * b[i] ;
        dtime=gettime_hp()-dtime;
        cout << "valarray[i] operator* " << dtime<< " ms\n" ;

        cout << "------------------------------------------------------\n" ;
    }
}

double gettime_hp()
{
    struct timespec timestamp;

    clock_gettime(CLOCK_REALTIME, &timestamp);
    return timestamp.tv_sec * 1000.0 + timestamp.tv_nsec * 1.0e-6;
}

Nice - just for reference, could you add the options used in the build (I may play around with this stuff tonight...) — Michael Burr, Jul 27 '11 at 22:27
+1. I ran this benchmark on the libc++ implementation. It wasn't as slow as MS, but wasn't as fast as gcc (it was about the same speed as you report for ICC). Turns out I was missing a key assignment operator in the expression template engine. Added that. Now libc++ is as fast as gcc. To the OP: Thanks for the speed test! (+1 on the question too) :-) — Howard Hinnant, Jul 27 '11 at 23:24
Thanks both - I've added a note re compiler switches (just `-O3` in both cases) and also appended the OP's code modified for Linux that I used for these tests. — Paul R, Jul 28 '11 at 06:55

score 13 · Accepted Answer · answered Jul 27 '11 at 20:46

13

I suspect that the reason c = a*b is so much slower than performing the operations an element at a time is that the

template<class T> valarray<T> operator*
    (const valarray<T>&, const valarray<T>&);

operator must allocate memory to put the result into, then returns that by value.

Even if a "swaptimization" is used to perform the copy, that function still has the overhead of

allocating the new block for the resulting valarray
initializing the new valarray (it's possible that this might be optimized away)
putting the results into the new valarray
paging in the memory for the new valarray as it is initialized or set with result values
deallocating the old valarray that gets replaced by the result

answered Jul 27 '11 at 20:46

Michael Burr

333,147
50
533
760

I just searched, it actually returns a reference: template inline valarray<_Ty>& operator*=(valarray<_Ty>& _L, const _Ty& _R) {_VALGOP2(*= _R); } – shangping Jul 27 '11 at 20:57
2

There are two differences in the declaration you posted in the comment above from what would be used in the code posted in the question: 1) `operator*=` is different than using `operator*()` followed by `operator=()`, and 2) that's the declaration for the `*=` operator that takes a scalar argument to multiply the `valarray` by – Michael Burr Jul 27 '11 at 21:02
Michael, according to the analysis, we have no way to use the valarray if performance is needed. However, according to the book, this class is specially designed for improving performance. Would you give me some points? Any other method to allow me to handle arrays as a whole just as valarray in c++? Thanks – shangping Jul 27 '11 at 21:13
@Michael: you may be right but see my Linux benchamrks in a separate answer - it certainly seems that valarray performance does not need to be significantly inferior to a straight loop, provided you use a decent compiler and valarray implementation. – Paul R Jul 27 '11 at 21:27
@shangping: It was intended, originally, that `valarray` could have special CPU instructions like SSE used. They *can* significantly increase the speed of what's going on. However, the copy time that you're experiencing here way outweighs that advantage- if your implementation even does it. – Puppy Jul 27 '11 at 21:28
@shangping: sorry, I wouldn't be much help with that part of the problem. Kerrick SB posted a comment that says that GCC 4.6.1 doesn't show the same degradation - maybe using MinGW might be a solution for you? Another possibility is to use your own template function for these operations that takes an output parameter by reference. It would probably make your code look uglier than you'd like though. – Michael Burr Jul 27 '11 at 21:31
11

News flash: valarray arithmetic is allowed but not required to use *expression templates* (http://en.wikipedia.org/wiki/Expression_templates). Use of expression templates can completely eliminate temporaries in the OP's problem, and thus completely eliminate heap allocation and deallocation for the expression `c = a*b`. It is evident that gcc does this (and a slightly corrected libc++), and that MS C++ doesn't. – Howard Hinnant Jul 27 '11 at 23:47
@howard, this information is very helpful. I was trying to do this following the instructions on the book, but I cannot get it done correctly. Would you be able to show me the 'delayed evaluation' or expression templates specifically for the c=a*b problem? Thanks – shangping Jul 28 '11 at 05:03
@shangping: Programming expression templates is one of the harder exercises in C++. But the wikipedia article I linked to above is not a bad start. The `VecDifference` example it uses could be trivially modified to be `VecProduct`. – Howard Hinnant Jul 28 '11 at 13:01
Too bad the accepted answer is simply WRONG! All you are talking about is simply NOT TRUE! The allocation will always be done (to store the result) and there won't be any copy as any decent compiler will perform return value optimization. As a matter of fact, with gcc there is no significant difference between the three implementation ... – PierreBdR Nov 01 '13 at 09:24
@PierreBdR but the reason gcc performs well is not because of return value optimization, it's because gcc's `valarray` uses expression templates (see Howard Hinnant's comment above). It's not due to clever compiler optimizations, it's due to a clever Standard Library implementation of `std::valarray`. – Jonathan Wakely Apr 05 '18 at 23:12

score 6 · Answer 3 · edited May 09 '19 at 09:52

The whole point of valarray is to be fast on vector machines, which x86 machines just aren't.

A good implementation on a nonvector machine should be able to match the performance that you get with something like

for (i=0; i < N; ++i) 
    c1[i] = a1[i] * b1[i];

and a bad one of course won't. Unless there is something in the hardware to expedite parallel processing, that is going to be pretty close to the best that you can do.

score 3 · Answer 4 · edited May 09 '19 at 09:51

I finally got this through using delayed evaluation. The code may be ugly since I am just starting learning these C++ advanced concepts.

Here is the code:

#include <iostream>
#include <valarray>
#include <iostream>
#include "windows.h"

using namespace std;
SYSTEMTIME stime;
LARGE_INTEGER sys_freq;

double gettime_hp();

// To improve the c = a*b (it will generate a temporary first, assigned to 'c' and delete the temporary.
// Which causes the program really slow
// The solution is the expression template and let the compiler to decide when all the expression is known.


// Delayed evaluation
//typedef valarray<double> Vector;
class Vector;

class VecMul
{
    public:
        const Vector& va;
        const Vector& vb;
        //Vector& vc;
        VecMul(const Vector& v1, const Vector& v2): va(v1), vb(v2) {}
        operator Vector();
};

class Vector:public valarray<double>
{
    valarray<double> *p;

    public:
        explicit Vector(int n)
        {
            p = new valarray<double>(n);
        }
        Vector& operator = (const VecMul &m)
        {
            for(int i=0; i<m.va.size(); i++)
                (*p)[i] = (m.va)[i]*(m.vb)[i]; // Ambiguous
            return *this;
        }
        double& operator[](int i) const {return (*p)[i];} //const vector_type[i]
        int size()const {return (*p).size();}
};


inline VecMul operator*(const Vector& v1, const Vector& v2)
{
    return VecMul(v1, v2);
}


int main()
{
    enum {N = 5*1024*1024};
    Vector a(N), b(N), c(N);
    QueryPerformanceFrequency(&sys_freq);
    int i, j;
    for (j=0 ; j<8 ; ++j)
    {
        for (i=0 ; i<N ; ++i)
        {
            a[i] = rand();
            b[i] = rand();
        }

        double* a1 = &a[0], *b1 = &b[0], *c1 = &c[0];
        double dtime = gettime_hp();
        for (i=0 ; i<N ; ++i)
            c1[i] = a1[i] * b1[i];
        dtime = gettime_hp()-dtime;
        cout << "double operator* " << dtime << " ms\n";

        dtime = gettime_hp();
        c = a*b;
        dtime = gettime_hp()-dtime;
        cout << "valarray operator* " << dtime << " ms\n";

        dtime = gettime_hp();
        for (i=0 ; i<N ; ++i)
            c[i] = a[i] * b[i];
        dtime = gettime_hp() - dtime;
        cout << "valarray[i] operator* " << dtime << " ms\n";

        cout << "------------------------------------------------------\n";
    }
}

double gettime_hp()
{
    LARGE_INTEGER tick;
    extern LARGE_INTEGER sys_freq;
    QueryPerformanceCounter(&tick);
    return (double)tick.QuadPart*1000.0/sys_freq.QuadPart;
}

The running result on Visual studio is:

double operator* 41.2031 ms
valarray operator* 43.8407 ms
valarray[i] operator* 42.49 ms

Why does your `class Vector` have _two_ valarray objects? One as a base class, and another allocated on the heap? It seems you only use the one on the heap, but that requires additional allocations. Just use the base class! — Jonathan Wakely, Apr 05 '18 at 23:33

score 1 · Answer 5 · edited May 09 '19 at 09:24

I'm compiling in release x64, Visual Studio 2010. I changed your code very slightly:

    double* a1 = &a[0], *b1 = &b[0], *c1 = &c[0];
    double dtime = gettime_hp();
    for (i=0 ; i<N ; ++i)
        a1[i] *= b1[i];
    dtime = gettime_hp() - dtime;
    cout << "double operator* " << dtime << " ms\n";

    dtime = gettime_hp();
    a *= b;
    dtime = gettime_hp() - dtime;
    cout << "valarray operator* " << dtime << " ms\n";

    dtime = gettime_hp();
    for (i=0 ; i<N ; ++i)
        a[i] *= b[i];
    dtime = gettime_hp() - dtime;
    cout << "valarray[i] operator* " << dtime<< " ms\n";

    cout << "------------------------------------------------------\n" ;

Here you can see that I used *= instead of c = a * b. In more modern mathematical libraries, very complex expression template mechanisms are used that eliminate this problem. In this case, I actually got very slightly faster results from valarray, although that's probably just because the contents were already in a cache. The overhead that you are seeing is simply redundant temporaries and nothing intrinsic to valarray, specifically- you'd see the same behaviour with something like std::string.

I verified your results. This change is not slight change though. Many compounding expression cannot always be done using *=, += /= — shangping, Jul 27 '11 at 22:19
@shangping: In that case, if you allocated a new result array for each of the temporary variables that you needed, you would see a similar slowdown for `double` as for `valarray`. — Puppy, Jul 28 '11 at 08:58
_"In more modern mathematical libraries, very complex expression template mechanisms are used that eliminate this problem."_ And also in high-quality implementations of `std::valarray`. — Jonathan Wakely, Apr 05 '18 at 23:38

score -1 · Answer 6 · edited May 09 '19 at 09:59

-1

I think Michael Burr's reply is right. And maybe you can create a virtual type as the type the return value of operator +, and reload another operator= for this virtual type like operator=(virtual type& v){&valarray=&v;v=NULL;} (roughly speaking).

Of course, it is difficult to implement the idea on valarray. But when you create a new class, you can try this idea. And then, the efficiency for operator+ is almost the same as operator+=.

edited May 09 '19 at 09:59

Peter Mortensen

30,738
21
105
131

answered May 06 '15 at 05:45

Lin Xuelei

111
3

1

This should have been a comment to his answer, not a new answer. – A.L. Nov 29 '17 at 13:08

score -2 · Answer 7 · edited May 09 '19 at 09:58

Hmm..I tested Blitz++ and it's same as valarray... And moreover, the Blitz++ [] operator is very slow.

#include <blitz/array.h>
#include <iostream>

#ifdef WIN32
#include "windows.h"
LARGE_INTEGER sys_freq;
#endif

#ifdef LINUX
<ctime>
#endif

using namespace std;
SYSTEMTIME stime;

__forceinline double gettime_hp();
double gettime_hp()
{
    #ifdef WIN32
        LARGE_INTEGER tick;
        extern LARGE_INTEGER sys_freq;
        QueryPerformanceCounter(&tick);
        return (double)tick.QuadPart * 1000.0 / sys_freq.QuadPart;
    #endif

    #ifdef LINUX
        struct timespec timestamp;

        clock_gettime(CLOCK_REALTIME, &timestamp);
        return timestamp.tv_sec * 1000.0 + timestamp.tv_nsec * 1.0e-6;
    #endif
}
BZ_USING_NAMESPACE(blitz)

int main()
{
    int N = 5*1024*1024;

    // Create three-dimensional arrays of double
    Array<double, 1> a(N), b(N), c(N);

    int i, j;

    #ifdef WIN32
        QueryPerformanceFrequency(&sys_freq);
    #endif

    for (j=0 ; j<8 ; ++j)
    {
        for (i=0 ; i<N ; ++i)
        {
            a[i] = rand();
            b[i] = rand();
        }

        double* a1 = a.data(), *b1 = b.data(), *c1 = c.data();
        double dtime = gettime_hp();
        for (i=0 ; i<N ; ++i)
            c1[i] = a1[i] * b1[i];
        dtime = gettime_hp() - dtime;
        cout << "double operator* " << dtime << " ms\n";

        dtime = gettime_hp();
        c = a*b;
        dtime = gettime_hp() - dtime;
        cout << "blitz operator* " << dtime << " ms\n";

        dtime = gettime_hp();
        for (i=0 ; i<N ; ++i)
            c[i] = a[i] * b[i];
        dtime = gettime_hp() - dtime;
        cout << "blitz[i] operator* " << dtime<< " ms\n";

        cout << "------------------------------------------------------\n";
    }
}

9

...and what was the result? – quant Aug 11 '14 at 04:50

Why is valarray so slow?

7 Answers7

Linked