3

I'm getting puzzling results while doing fairly simple tasks to compare the performance of:

  • Eigen::Matrix
  • boost::multi_array
  • boost::multi_array mapped to Eigen::Matrix using Eigen::Map

This is an abridged version of my test code; a fuller version can be found at: http://pastebin.com/faZ7TvJG.

boost::multi_array<double, 2, Eigen::aligned_allocator<double> > boost_multi_array;
Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> eigen_matrix;
Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> > boost_multi_array_mapped(boost_multi_array.data(), rows, cols);

double
    tmp_sum = 0,
    tmp_time = omp_get_wtime();

for(size_t i=0; i<iterations; i++)
{
    for(size_t j=0; j<rows; j++)
    {
        for(size_t k=0; k<cols; k++)
        {
            //if(k%2==0)
            //{ // commented out are the different options
                //tmp_sum += boost_multi_array[j][k];
                //tmp_sum += boost_multi_array_mapped(j,k);
                tmp_sum += eigen_matrix(j,k);
            //}
        }
    }
}

const double sequential_access_time = omp_get_wtime() - tmp_time;

The results are as follows:

Sequential Access:
   BOOST (MAPPED)   : 1.45763s
   EIGEN            : 1.45736s
   BOOST            : 2.58971s

If I use an if-statement to skip every second element, I then get the following results:

Alternating Access:
   BOOST (MAPPED)   : 1.67301s
   EIGEN            : 2.08834s
   BOOST            : 2.35295s

Inspecting the assembly shows that in the sequential access case, Eigen is faster because the sum becomes vectorized, while it does not when using raw boost::multi_array.

My questions then are:

  1. Why is boost::multi_array not vectorized, while Eigen::Matrix is?
  2. Why would a multi_array mapped to Eigen be faster than a "native" Eigen data structure?

For compilation I use the following:

g++ -I /usr/include/eigen3 test.cpp -Wall -O3 -DNDEBUG -DBOOST_DISABLE_ASSERTS -fopenmp -ffast-math -funroll-loops -march=native -mtune=native -o array_test
alfC
  • 14,261
  • 4
  • 67
  • 118
mohomran
  • 49
  • 3
  • Which gcc version do you use ? – rodrigob Jun 07 '13 at 19:14
  • I don't have a specific answer but I did a similar analysis and found Eigen is also much fast in all matrix operation as oppose to boost. Eigen was just specialize for matrix operations. – andre Jun 07 '13 at 19:15
  • @rodrigob: I use gcc 4.7.2. – mohomran Jun 07 '13 at 19:23
  • Eigen includes vectorized code for many operations, while I'm not sure if boost does. You can compile with `-DEIGEN_DONT_VECTORIZE` to disable Eigen's vectorized code. – David Brown Jun 07 '13 at 19:30
  • 1
    What are you trying to measure with this kind of benchmark? You are only using accessors. There is nothing a library can do about them, except not adding stupid overhead. You should at list try the library reduction algorithms, e.g., by calling .row(j).head(cols_subset).sum() on each row. Also what's the point of benchmarking on a 5GB data set? – ggael Jun 07 '13 at 21:04
  • I am actually measuring performance on more complex tasks. This was just a warm-up benchmark, yet the results are odd. Why for example is less overhead added when mapping a boost::multi_array to Eigen, than when directly using an Eigen::Matrix? – mohomran Jun 07 '13 at 21:57
  • Benchmarking can have other purposes besides obtaining raw measurements, e.g. understanding what's going on under the hood. Also, the application I'm working on is highly memory intensive. – mohomran Jun 07 '13 at 21:59
  • The key to the compiler being able to vectorize is the "restricted" keyword. Maybe they took care of using it in Eigen, which is why a boost multi array is fast if it using Eigen as storage. – migle Jun 07 '13 at 22:23
  • @ggael we benchmark on 5Gb because our production code handles data in the range 3 to 90 Gb. The code we are trying to optimize obviously does more than a simple sum; the shared piece of simple code already shows the "strange behaviors" that we observe in the case of interest. – rodrigob Jun 09 '13 at 11:21
  • @DavidBrown boost for sure does not include vectorization primitives. Note however that we are not calling "an_eigen_matrix.sum()" which would clearly be highly optimized, but simply "tmp_sum += an_eigen_matrix(j, k)"; we wonder what is Eigen doing there that gives "extra hints" to the compiler missing the "tmp_sum += boost_multi_array[j][k]" case. – rodrigob Jun 09 '13 at 11:24
  • @rodrigob I expect it's because `an_eigen_matrix(j,k)` is only one function call whereas `boost_multi_array[j][k]` is two function calls. – David Brown Jun 09 '13 at 17:06
  • @DavidBrown we did try to split the access to boost_multi_array using boost::multi_array<...>::const_reference, to do "direct row-wise access" (boost_multi_array_row_j[k]). Doing so made no speed difference, this makes think that doing "one or two function calls" is not the issue (plus operator[] is expected to be inlined, so no real function call is done). – rodrigob Jun 10 '13 at 13:32
  • Boost.MA can reach the same access speed as a native (2D) array or even with linear access: https://stackoverflow.com/a/72596238/225186. Perhaps the compilers improved since this question was asked or there was some flaw in the benchmarking method. – alfC Jun 14 '22 at 09:23

0 Answers0