Boost MultiArrays performance is poor

Question

I noticed that my boost mutiarrays were performing very badly compared to STL Vector. I came upon this question asked earlier, where the most liked answer stated that

1) Boost is nearly as fast as native array

2) You need to change the order in which you access your data elements to get the best performance out of Boost MultiArray. Also, that you need to run in Release mode, and not Debug.

Well, I did all that, and yet the performance of my MultiArrays is pretty shabby.

I am posting my code here :

A) WITH DEFAULT ORDERING

#include <windows.h>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
#include <stdio.h>
#include <conio.h>
#include <iostream>

int main(int argc, char* argv[])
{
    const int X_SIZE = 400;
    const int Y_SIZE = 400;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    //------------------Measure boost----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] *= 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    //------------------Measure native-----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] *= 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    return 0;
}

B) WITH INVERTED ORDERING

#include <windows.h>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
#include <stdio.h>
#include <conio.h>
#include <iostream>

int main(int argc, char* argv[])
{
    const int X_SIZE = 400;
    const int Y_SIZE = 400;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    //------------------Measure boost----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int x = 0; x < X_SIZE; ++x)
        {
            for (int y = 0; y < Y_SIZE; ++y)
            {
                boostMatrix[x][y] *= 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    //------------------Measure native-----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int x = 0; x < X_SIZE; ++x)
        {
            for (int y = 0; y < Y_SIZE; ++y)
            {
                nativeMatrix[x + (y * X_SIZE)] *= 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    return 0;
}

In every possible permutation, my benchmarks are approximately the same :

1) For Native code : 0.15s

2) For Boost MultiArray : 1.0s
I am using Visual Studio 2010.

My question is : given that I am using Visual Studio, how to get good performance from Boost MultiArrays?

UPDATE :

I switched over to Visual Studio 2013. There, I enabled the Qvec-report2 compiler switch. And very interestingly, when I compiled, I started receiving an info message saying that the compiler was failing to vectorize. Here is a sample info message which looks almost like a warning. I received several such messages for the simplest of code.

--- Analyzing function: void __cdecl `vector constructor iterator'(void * __ptr64,unsigned __int64,int,void * __ptr64 (__cdecl*)(void * __ptr64)) 1> D:\Workspace\test\Scrap\Scrap\Source.cpp : info C5002: loop not vectorized due to reason '1301'

I think this is a major clue as to why Boost multiarrays are performing slower on my Visual Studio while they perform alright on GCC. Given this extra information, can you please think of a way to resolve the problem?

@Admins : Kindly unmark my question as previously answered. I have made a major edit.

You might want to compare the generated assembler code. Also you might want to consider putting this on codereview.se. Furthermore I don't see what the exact difference between yours and the linked question is, as the questioner is getting to the same results as you (as well as the "answers" stating they get the same difference for msvc compilers), and the answers that get different values use gcc. — PlasmaHH, Jun 20 '14 at 20:34
Rules of SO say that new questions should be new, and just because another didn't get an answer isn't a good enough reason to repost it, even if it is old. I am not drawing any conclusion of if it is possible or not; I suggest you do a real analysis, starting with the generated assembler code, looking for major differences. There evidently is some difference, and if you spot it, you might even be able to reformulate your question in terms of "why is this generated assembler faster" (if it isn't obvious then already) — PlasmaHH, Jun 20 '14 at 20:42
Where did the questioner say he was using MSVC? May be the questioner was using GCC himself, and was satisfied by the answer since it worked for GCC. — The Vivandiere, Jun 20 '14 at 20:46
There are multiple hints, like using `` talking about debug/release builds, using `_SCL_SECURE_NO_WARNINGS` and not objecting to the observation "It looks like you are using msvc++" as well as using `GetTickCount` to measure performance (winapi) — PlasmaHH, Jun 20 '14 at 20:49
@user3670482 It looks like in the [original question](http://stackoverflow.com/questions/446866) the questioner was also using MSVC and was able to get a 16 to 1 performance difference vs. raw arrays when using that library with that compiler. That's the situation you're in - is that correct? — Drew Dormann, Jun 20 '14 at 21:01
I found a workaround. For now, I am either going to use Boost ublas matrix or Eigen MatrixXd. Their performance is near native implementation. — The Vivandiere, Jun 29 '14 at 14:55
Did you take this into account? http://stackoverflow.com/a/567530/225186 "MultiArray can index arrays from bases different from 0. That means that multi_array stores a base number (in each dimension) and uses a more complicated formula to obtain the exact location in memory". The extra integer multiplications and additions to locate the point in memory can explain the difference. In some sense multi_arrays are more general because they can have a) arbitrary base index b) arbitrary order. — alfC, Nov 12 '15 at 23:10

alfC · Answer 1 · 2023-08-31T09:34:12.627

Let's see if this issue can be settled now.

Here I translated your benchmarks to the Google Benchmark framework and I found that there no empirical reason to expect that MultiArray will be slower than native arrays if the code is compiled with a decent level of optimization. I increased the array sizes to obtain less noisy results.

You have to be very careful on how you measure times and what are you really measuring:

The main issue is that to make a one-to-one comparison you need to see how data is initializing before access. The main difference between new[] and boost::multi_array is that the later initializes elements to zero after allocation. When that happens the memory is "hot" for the subsequent loop (in apparent favor of Boost.MA)

To make the tests more symmetric I initialized with new[]{} (also with Google Benchmark you can use other techniques).

These are the results, (note that "Inverted" means correct, favorable order of access):

2022-06-12T14:52:44-07:00
Running ./element_access.cppx
Run on (12 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 4.25, 4.07, 4.38
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
MeasureNative                     8535616 ns      8535519 ns           83
MeasureBoostMultiArray           91876643 ns     91874361 ns            8
MeasureBoostMultiArrayInverted    8597615 ns      8597571 ns           82
MeasureBoostMultiArrayRaw         8555218 ns      8554977 ns           83

Here it is the full code, which was compiled with c++ -std=c++17 -O3 -DNDEBUG -DBOOST_DISABLE_ASSERTS element_access.cpp -o element_access.cppx -lbenchmark.

#include <boost/multi_array.hpp>

#include <benchmark/benchmark.h>  // Google micro benchmarking library

const int X_SIZE = 4000;
const int Y_SIZE = 4000;

typedef boost::multi_array<double, 2> ImageArrayType;

static void MeasureNative(benchmark::State& state) {

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE]{};

    for (auto _ : state) {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] *= 2.345;
            }
        }
        benchmark::DoNotOptimize(nativeMatrix);
    }
    delete[] nativeMatrix;
}
BENCHMARK(MeasureNative);

static void MeasureBoostMultiArray(benchmark::State& state) {

    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    for (auto _ : state) {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] *= 2.345;
            }
        }
        benchmark::DoNotOptimize(boostMatrix);
    }
}
BENCHMARK(MeasureBoostMultiArray);


static void MeasureBoostMultiArrayInverted(benchmark::State& state) {

    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    for (auto _ : state) {
        for (int x = 0; x < X_SIZE; ++x)
        {
            for (int y = 0; y < Y_SIZE; ++y)
            {
                boostMatrix[x][y] *= 2.345;
            }
        }
        benchmark::DoNotOptimize(boostMatrix);
    }
}
BENCHMARK(MeasureBoostMultiArrayInverted);

static void MeasureBoostMultiArrayRaw(benchmark::State& state) {

    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    for (auto _ : state) {
        for (int n = 0; n < boostMatrix.num_elements(); ++n)
        {
            boostMatrix.data()[n] *= 2.345;
        }
        benchmark::DoNotOptimize(boostMatrix);
    }
}
BENCHMARK(MeasureBoostMultiArrayRaw);

BENCHMARK_MAIN();

Finally, here you can see Native, Boost.MultiArray and another "Multi" library transform the loop into the same machine code with GCC https://godbolt.org/z/aeGoPvr6h .

Boost MultiArrays performance is poor

1 Answers1

Linked