Lazy vector access in parallel loops

Question

Inside a performance-critical, parallel code I have a vector whose elements are:

Very expensive to compute, and the result is deterministic (the value of the element at a given position will depend on the position only)
Random access (typically the number of accesses are larger or much larger than the size of the vector)
Clustered accesses (many accesses request the same value)
The vector is shared by different threads (race condition?)
To avoid heap defragmention, the object should never be recreated, but whenever possible resetted and recycled
The value to be placed in the vector will be provided by a polymorphic object

Currently, I precompute all possible values of the vectors, so race condition should not be an issue. In order to improve performances, I am considering to create a lazy vector, such that the code performs computations only when the element of the vector is requested. In a parallel region, it might happen that more than one thread are requesting, and perhaps calculating, the same element at the same time. How do I take care of this possible race condition?

Below is an example of what I want to achieve. It compiles and runs properly under Windows 10, Visual Studio 17. I use C++17.

// Lazy.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include <vector>
#include <iostream>
#include <stdlib.h>  
#include <chrono>
#include <math.h>
const double START_SUM = 1;
const double END_SUM = 1000;

//base object responsible for providing the values
class Evaluator
{
public:
    Evaluator() {};
    ~Evaluator() {};
    //Function with deterministic output, depending on the position
    virtual double expensiveFunction(int pos) const = 0;
};
//
class EvaluatorA: public Evaluator
{
public:
    //expensive evaluation
    virtual double expensiveFunction(int pos) const override {
        double t = 0;
        for (int j = START_SUM; j++ < END_SUM; j++)
            t += log(exp(log(exp(log(j + pos)))));
        return t;
    }
    EvaluatorA() {};
    ~EvaluatorA() {};
};
class EvaluatorB : public Evaluator
{
public:
    //even more expensive evaluation
    virtual double expensiveFunction(int pos) const override {
        double t = 0;
        for (int j = START_SUM; j++ < 10*END_SUM; j++)
            t += log(exp(log(exp(log(j + pos)))));
        return t;
    }
    EvaluatorB() {};
    ~EvaluatorB() {};
};

class LazyVectorTest //vector that contains N possible results
{
public:
    LazyVectorTest(int N,const Evaluator & eval) : N(N), innerContainer(N, 0), isThatComputed(N, false), eval_ptr(&eval)
    {};
    ~LazyVectorTest() {};

    //reset, to generate a new table of values
    //the size of the vector stays constant
    void reset(const Evaluator & eval) {
        this->eval_ptr = &eval;
        for (int i = 0; i<N; i++)
            isThatComputed[i] = false;
    }
    int size() { return N; }
    //accessing the same position should yield the same result
    //unless the object is resetted
    const inline double& operator[](int pos) {
        if (!isThatComputed[pos]) {
            innerContainer[pos] = eval_ptr->expensiveFunction(pos);
            isThatComputed[pos] = true;
        }
        return innerContainer[pos];
    }

private:
    const int N;
    const Evaluator* eval_ptr;
    std::vector<double> innerContainer;
    std::vector<bool> isThatComputed;
};
    //the parallel access will take place here

template <typename T>
double accessingFunction(T& A, const std::vector<int>& elementsToAccess) {
    double tsum = 0;
    int size = elementsToAccess.size();
//#pragma omp parallel for
    for (int i = 0; i < size; i++)
        tsum += A[elementsToAccess[i]];
    return tsum;
}

std::vector<int> randomPos(int sizePos, int N) {
    std::vector<int> elementsToAccess;
    for (int i = 0; i < sizePos; i++)
        elementsToAccess.push_back(rand() % N);
    return elementsToAccess;
}
int main()
{
    srand(time(0));
    int minAccessNumber = 1;
    int maxAccessNumber = 100;
    int sizeVector = 50;

    auto start = std::chrono::steady_clock::now();
    double res = 0;
    float numberTest = 100;
    typedef LazyVectorTest container;

    EvaluatorA eval;
    for (int i = 0; i < static_cast<int>(numberTest); i++) {
        res = eval.expensiveFunction(i);
    }
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double, std::milli>diff(end - start);
    double benchmark = diff.count() / numberTest;
    std::cout <<"Average time to compute expensive function:" <<benchmark<<" ms"<<std::endl;
    std::cout << "Value of the function:" << res<< std::endl;
    std::vector<std::vector<int>> indexs(numberTest);
    container A(sizeVector, eval);

    for (int accessNumber = minAccessNumber; accessNumber < maxAccessNumber; accessNumber++) {
        indexs.clear();
        for (int i = 0; i < static_cast<int>(numberTest); i++) {
            indexs.emplace_back(randomPos(accessNumber, sizeVector));
        }
        auto start_lazy = std::chrono::steady_clock::now();
        for (int i = 0; i < static_cast<int>(numberTest); i++) {
            A.reset(eval);
            double res_lazy = accessingFunction(A, indexs[i]);
        }
        auto end_lazy = std::chrono::steady_clock::now();
        std::chrono::duration<double, std::milli>diff_lazy(end_lazy - start_lazy);
        std::cout << accessNumber << "," << diff_lazy.count() / numberTest << ", " << diff_lazy.count() / (numberTest* benchmark) << std::endl;
    }
    return 0;
}

Your descriptions screams to me ["Dynamic Programming"](https://stackoverflow.com/questions/1065433/what-is-dynamic-programming) with a [`std::map`](http://www.cplusplus.com/reference/map/map/)-based implementation... Kinda long to turn into an answer right now (I'll try later if someone doesn't beat me to it), but I'd suggest you to take a look at those links as a starting point. Just make sure to use locking as needed wen generating the values for the first time and you'll be safe from race conditions. — GPhilo, Jul 16 '18 at 08:10
I don't really see that. There is no recursion in the function being evaluated, so I don't see how DP would help (each valuation is independent from the others). And a map is slower to access than a vector. — Enzo Ferrazzano, Jul 16 '18 at 08:38
DP does not need explicit recursion. You mention that values are used multiple times and always deterministic, that's all you need to justify DP, since you base further computations on the previously computed values. About the `map`, I agree. The question is whether you always use all positions in the array (in which case, `vector` is the way to go) or there can be a (reasonably high) number of unused values that would make a vector approach excessively wasteful memory-wise — GPhilo, Jul 16 '18 at 08:43
Is the vector size known at compile time? Or is it determined at runtime? And, if at runtime, is it known at the beginning of the algorithm or can it grow throughout execution? — GPhilo, Jul 16 '18 at 08:47
the size is known at runtime, but once estabilshed, it stays constant. — Enzo Ferrazzano, Jul 16 '18 at 08:52
The choice of `vector` or `map` rather depends on the distribution of values. If you need 10 unique values uniformly in a range of 0-1,000,000,000 then `map` will probably be faster — Caleth, Jul 16 '18 at 08:56

Caleth · Answer 1 · 2018-07-16T14:06:19.000

3

Rather than roll you own locking, I'd first see if you get acceptable performance with std::call_once.

class LazyVectorTest //vector that contains N possible results
{
    //Function with deterministic output, depending on the position
    void expensiveFunction(int pos) {
        double t = 0;
        for (int j = START_SUM; j++ < END_SUM; j++)
            t += log(exp(log(exp(log(j+pos)))));
        values[pos] = t;
    }

public:
    LazyVectorTest(int N) : values(N), flags(N)
    {};

    int size() { return values.size(); }
    //accessing the same position should yield the same result
    double operator[](int pos) {
        std::call_once(flags[pos], &LazyVectorTest::expensiveFunction, this, pos);
        return values[pos];
    }
private:
    std::vector<double> values;
    std::vector<std::once_flag> flags;
};

call_once is pretty transparent. It allows exactly one thread to run a function to completion. The only potential drawback is that it will block a second thread waiting for a possible exception, rather than immediately do nothing. In this case that is desirable, as you want the modification values[pos] = t; to be sequenced before the read return values[pos];

edited Jul 16 '18 at 14:06

answered Jul 16 '18 at 08:52

Caleth

52,200
2
44
75

Didn't even know `call_once` existed, nice solution! – GPhilo Jul 16 '18 at 09:06
In the serial version of your code this is even a little bit faster than mine! :). Unfortunately, to parallelize this example seems not to bring much in terms of speed. I will try on my code and see how it looks like. I would have two questions, as I am totally new to this aspect of c++: - what are the possible drawbacks of call_once? - is there a way to reset the flags rather than reassign them? – Enzo Ferrazzano Jul 16 '18 at 10:21
Thanks! The flag resetting does not really work, as the copy constructor is deleted for once_flag struct. The idea seems that the once_flag should not be reused. Therefore the lazy object must be destroyed and reconstructed every time I need a new one, In my code is advisable to avoid this, as the object might be create millions of times, and I need to avoid reallocating memory as much as possible. I will add this to my request – Enzo Ferrazzano Jul 16 '18 at 13:49
This is quite elegant and will probably work in practice. The question is: is it guaranteed that all threads see the updated `values[pos]` after they passed the `call_once`? I don't think so - partially because the interaction between the OpenMP threading/memory model and C++11 threading is unspecified. – Zulan Jul 17 '18 at 11:18
This will not compile, unless the `vector` is completely pre-allocated by construction rsp. assignment, because `once_flag` is not MoveInsertable. A `std::deque` may be a better choice here because it allows using at least `emplace_back()`/`emplace_front()`. – Arne Vogel Jul 25 '18 at 11:44
@ArneVogel it *is*. You'd have to make a bunch of changes to make this class copyable / resizable etc, but that isn't what OP asked for – Caleth Jul 25 '18 at 11:49

Zulan · Accepted Answer · 2018-07-17T07:28:55.307

Your current code is problematic, mainly because of std::vector<bool> being horrible, but also atomicity and memory consistency is missing. Here is the sketch of a solution based entirely on OpenMP. I would suggest to actually special marker for missing entries instead of a separate vector<bool> - it makes everything much easier:

class LazyVectorTest //vector that contains N possible results
{
public:
    LazyVectorTest(int N,const Evaluator & eval) : N(N), innerContainer(N, invalid), eval_ptr(&eval)
    {};
    ~LazyVectorTest() {};

    //reset, to generate a new table of values
    //the size of the vector stays constant
    void reset(const Evaluator & eval) {
        this->eval_ptr = &eval;
        for (int i = 0; i<N; i++) {
            // Use atomic if that could possible be done in parallel
            // omit that for performance if you doun't ever run it in parallel
            #pragma omp atomic write
            innerContainer[i] = invalid;
        }
        // Flush to make sure invalidation is visible to all threads
        #pragma omp flush
    }
    int size() { return N; }
    // Don't return a reference here
    double operator[] (int pos) {
        double value;
        #pragma omp atomic read
        value = innerContainer[pos];
        if (value == invalid) {
            value = eval_ptr->expensiveFunction(pos);
            #pragma omp atomic write
            innerContainer[pos] = value;
        }
        return value;
    }

private:
    // Use nan, inf or some random number - doesn't really matter
    static constexpr double invalid = std::nan("");
    const int N;
    const Evaluator* eval_ptr;
    std::vector<double> innerContainer;
};

In case of a collision, the other threads will just redundantly compute the value. - exploiting the deterministic nature. My using omp atomic on both read and write of the elements, you ensure that no inconsistent "half-written" values are ever read.

This solution may create some additional latency for the rare bad cases. In turn, the good cases are optimal, with just a single atomic read. You don't even need any memory flushes / seq_cst - worst case is a redundant computation. You would need these (sequential consistency) if you write the flag and value separately, to ensure the order in which the changes becomes visible is correct.

the behavior of the array of bools is very interesting. What if the atomic behavior does not apply, like instead of a double I have custom class? Should I use a mutex? — Enzo Ferrazzano, Jul 17 '18 at 19:46
In the case of a custom class you would start with a `{ std::lock_guard lock(mutex[pos]); if (!innerContainer[pos]) { innerContainer[pos] = std::make_unique<...>(...) } return innerContainer[pos]; }`. If that ever becomes a performance bottleneck you would could also to do an DCLP - which are notoriously hard to get right. You can use either C++11 mutexes or OpenMP mutexes... The latter are more well-defined in the context of OpenMP but not as nice to use. — Zulan, Jul 18 '18 at 08:51
Actually in the real code we use a customized thread pool, as test shows that for this particular function it outperforms OpenMP (fact for which I don't really have an explanation yet). So Mutex would be the way to go. This was very informative. Thanks$ — Enzo Ferrazzano, Jul 19 '18 at 06:04

Lazy vector access in parallel loops

2 Answers2