17

To speed up the calculations in my library, I decided to use the std::valarray class. The documentation says:

std::valarray and helper classes are defined to be free of certain forms of aliasing, thus allowing operations on these classes to be optimized similar to the effect of the keyword restrict in the C programming language. In addition, functions and operators that take valarray arguments are allowed to return proxy objects to make it possible for the compiler to optimize an expression such as v1 = a * v2 + v3; as a single loop that executes v1[i] = a * v2[i] + v3[i]; avoiding any temporaries or multiple passes.

This is exactly what I need. And it works as described in the documentation when I use the g++ compiler. I have developed a simple example to test the std::valarray performance:

void check(std::valarray<float>& a)
{
   for (int i = 0; i < a.size(); i++)
      if (a[i] != 7)
         std::cout << "Error" << std::endl;
}

int main()
{
   const int N = 100000000;
   std::valarray<float> a(1, N);
   std::valarray<float> c(2, N);
   std::valarray<float> b(3, N);
   std::valarray<float> d(N);

   auto start = std::chrono::system_clock::now();
   d = a + b * c;
   auto end = std::chrono::system_clock::now();

   std::cout << "Valarr optimized case: "
      << (end - start).count() << std::endl;

   check(d);

   // Optimal single loop case
   start = std::chrono::system_clock::now();
   for (int i = 0; i < N; i++)
      d[i] = a[i] + b[i] * c[i];
   end = std::chrono::system_clock::now();
   std::cout << "Optimal case: " << (end - start).count() << std::endl;

   check(d);
   return 0;
}

On g++ I got:

Valarr optimized case: 1484215
Optimal case: 1472202

It seems that all operations d = a + b * c; are really placed in one cycle, which simplifies the code while maintaining performance. However, this does not work when I use Visual Studio 2015. For the same code, I get:

Valarr optimized case: 6652402
Optimal case: 1766699

The difference is almost four times; there is no optimization! Why is std::valarray not working as needed on Visual Studio 2015? Am I doing everything right? How can I solve the problem without abandoning std::valarray?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
dilbert
  • 173
  • 7

1 Answers1

23

Am I doing everything right?

You're doing everything right. The problem is in the Visual Studio std::valarray implementation.

Why is std::valarray not working as needed on Visual Studio 2015?

Just open the implementation of any valarray operator, for example operator+. You will see something like (after macro expansion):

   template<class _Ty> inline
      valarray<_Ty> operator+(const valarray<_Ty>& _Left,
         const valarray<_Ty>& _Right)
   {
      valarray<TYPE> _Ans(_Left.size());
      for (size_t _Idx = 0; _Idx < _Ans.size(); ++_Idx)
         _Ans[_Idx] = _Left[_Idx] + _Right[_Idx];
      return (_Ans)
   }

As you can see, a new object is created in which the result of the operation is copied. There really is no optimization. I do not know why, but it is a fact. It looks like in Visual Studio, std::valarray was added for compatibility only.

For comparison, consider the GNU implementation. As you can see, each operator returns the template class _Expr which contains only the operation, but does not contain data. The real computation is performed in the assignment operator and more specifically in the __valarray_copy function. Thus, until you perform assignment, all actions are performed on the proxy object _Expr. Only once operator= is called, is the operation stored in _Expr performed in a single loop. This is the reason why you get such good results with g++.

How can I solve the problem?

You need to find a suitable std::valarray implementation on the internet or you can write your own. You can use the GNU implementation as an example.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Dmytro Dadyka
  • 2,208
  • 5
  • 18
  • 31
  • 5
    I read an article about how `valarray` never quite had the performance that it was intended to have, in any compiler, so as a result, MSVC never bothered to optimize it, because it was always slow regardless. – Mooing Duck May 09 '19 at 00:23
  • 3
    I looked in the GNU `valarray` implementation. In this implementation, a template proxy object is returned and real calculations only occur when assigning. Performance is only slightly below the explicit use of cycles. Looks like it’s still possible to get effective `valarray`. – Dmytro Dadyka May 09 '19 at 00:33
  • 3
    https://developercommunity.visualstudio.com/content/problem/308961/stdvalarray.html for MS reply to a bug report. – Marc Glisse May 09 '19 at 05:59
  • @DmytroDadyka: You misunderstand. Microsoft's claim was that even with the optimizations in GNU, the `valarray` was only very slightly faster than the naive version, and still significantly slower than assembly using the desired commands. – Mooing Duck May 09 '19 at 20:47
  • https://www.quora.com/Why-does-nobody-seem-to-use-std-valarray/answer/Daniel-N%C3%A4slund Reading this, I vaguely remember that the problem was that users would make copies too often by accident, which completely negated the performance gains. Now that we have move constructors, that may or may not be better. – Mooing Duck May 09 '19 at 20:57