What is the reason of bad parallel performance?

Question

I'm trying to implement parallel algorithm that will compute Levenshtein distance between each of sequences in a list and store them in matrix (2d vector). In other words, I'm given 2d vector with numbers (thousands of number sequences of up to 30 numbers) and I need to compute Levenshtein distance between each vector of integers. I implemented serial algorithm that works, but when I tried to convert it to parallel, it is much slower (the more threads, the slower it is). The parallel version is implemented with c++11 threads (I also tried OpenMP, but with the same results).

Here is the function that distributes work:

vector<vector<int>> getGraphParallel(vector<vector<int>>& records){
    int V = records.size();
    auto threadCount = std::thread::hardware_concurrency();

    if(threadCount == 0){
        threadCount = 1;
    }
    vector<future<vector<vector<int>>>> futures;
    int rowCount = V / threadCount;
    vector<vector<int>>::const_iterator first = records.begin();
    vector<vector<int>>::const_iterator last = records.begin() + V;

    for(int i = 0; i < threadCount; i++){
        int start = i * rowCount;
        if(i == threadCount - 1){
            rowCount += V % threadCount;
        }
        futures.push_back(std::async(getRows, std::ref(records), start, rowCount, V));
    }

    vector<vector<int>> graph;
    for(int i = 0; i < futures.size(); i++){
        auto result = futures[i].get();
        for(const auto &row : result){
            graph.push_back(row);
        }
    }
    for(int i = 0; i < V; i++)
    {
        for(int j = i + 1; j < V; j++){
            graph[j][i] = graph[i][j];
        }
    }

    return graph;
}

Here is the function that computes rows of final matrix:

vector<vector<int>> getRows(vector<vector<int>>& records, int from, int count, int size){
    vector<vector<int>> result(count, vector<int>(size, 0));
    for(int i = 0; i < count; i++){
        for(int j = i + from + 1; j < size; j++){
            result[i][j] = levenshteinDistance(records[i + from], records[j]);
        }
    }
    return result;
}

And finally function that computes Levenshtein distance:

int levenshteinDistance(const vector<int>& first, const vector<int>& second){
    const int sizeFirst = first.size();
    const int sizeSecond = second.size();

    if(sizeFirst == 0) return sizeSecond;
    if(sizeSecond == 0) return sizeFirst;

    vector<vector<int>> distances(sizeFirst + 1, vector<int>(sizeSecond + 1, 0));

    for(int i = 0; i <= sizeFirst; i++){
        distances[i][0] = i;
    }

    for(int j = 0; j <= sizeSecond; j++){
        distances[0][j] = j;
    }

    for (int j = 1; j <= sizeSecond; j++)
        for (int i = 1; i <= sizeFirst; i++)
            if (first[i - 1] == second[j - 1])
                distances[i][j] = distances[i - 1][j - 1];
            else
                distances[i][j] = min(min(
                        distances[i - 1][j] + 1,
                        distances[i][j - 1] + 1),
                        distances[i - 1][j - 1] + 1
                );

    return distances[sizeFirst][sizeSecond];
}

One thing that came to my mind is that this slow down is caused by false sharing, but I could not check it with perf, because I'm working with Ubuntu in Oracle VirtualBox - cache misses are not avalaible there. If I'm right and the slow down is caused by false sharing, what should I do to fix it? If not, what is the reason of this slow down?

There is no such thing as a 2D vector. What you have is a vector of vectors. As in you're doing thousands and thousands of heap allocations. Check [this](https://stackoverflow.com/a/54884445/3212865) for more on this specific issue. Also, in a multi threaded context, it's likely to compounds as many allocation implementations don't split heaps by threads, so allocations require synchronisation. There might be other issues, that's just something I see there. — spectras, Dec 30 '19 at 23:20
Most likely cache invalidation would be the reason.. You might want to try it outside a VM or adjust VM configuration (like more RAM) — Phil1970, Dec 31 '19 at 01:24
Possibilities: Pass `distances` to `levenshteinDistance` by reference, and only resize it if you need a larger vector than you've already allocated. Swap the order of the `j` and `i` loops, so references to `distances` will be more cache friendly. — 1201ProgramAlarm, Dec 31 '19 at 01:27

score 0 · Answer 1 · answered Dec 31 '19 at 17:03

One problem I can see is that you are using std::async without declaring how it should run. It can either be run async or deferred. In the case of deferred it would all run in one thread, it would just be lazily evaluated. The default behavior is implementation defined. For your case it'd make sense that it would run slower with more deferred evaluations. You can try std::async(std::launch::async, ...).

Make sure your VM is also set up to use more than 1 core. Ideally when doing optimizations such as these it'd be best to try and eliminate as many variables as possible. If you can, run the program locally without a VM. Profiling tools are your best bet and will show you exactly where time is being spent.

What is the reason of bad parallel performance?

1 Answers1