Simple array write in Multithread code is slower than Single thread

Question

I'm trying to parallelize a code. But I noticed a strange behavior in C++. I simplified the problem to the following: I have a huge array (100M bytes). When I write random data on this data in a single thread it is very faster than running in parallel (for example 10 core). I assume that by considering the RAM speed which is more than 1GB/s, there should not be any problem in parallel write on RAM. The code is like this:

#include <iostream>
#include <type_traits>
#include <stdio.h>
#include <stdlib.h>
#include <cstring>
#include <chrono>
#include <thread>
using namespace std;

uint8_t g[16]{1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 10, 1};

uint8_t** data = new uint8_t*[1000];

void test() {
    for (int i = 1; i < 100000000; i++) {
        int row = rand() % 1000;
        int col = rand() % 10000000;
        memcpy(&data[row][col], &g[0], 16);
        memcpy(&data[row][col + 16], &g[0], 16);
    }
}
#define TH 1

int main() {
    for (int i = 0; i < 1000; i++) {
        data[i] = new uint8_t[10000000];
    }
    std::chrono::time_point<std::chrono::high_resolution_clock> m_beg = std::chrono::high_resolution_clock::now();
    std::thread* workers = new std::thread[TH];
    for (int i = 0; i < TH; i++) {
        workers[i] = std::thread(&test);
    }
    for (int i = 0; i < TH; i++) {
        workers[i].join();
    }

    double t = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - m_beg).count();
    cout << t << endl;
}

I compared to settings:

1-TH=1 , test loop counter=100M

2-TH=10, test loop counter=10M

and the result is as follow:

1-10 seconds

2-72 seconds

does anyone have any idea what is the reason?

There's no protection between the threads, which means that multiple threads can access the same data simultaneously. You also have no bounds-checking meaning you could copy out of bounds of the allocated memory. Furthermore [`std::rand`](https://en.cppreference.com/w/cpp/numeric/random/rand) returns a value between 0 and [`RAND_MAX`](https://en.cppreference.com/w/cpp/numeric/random/RAND_MAX), and `RAND_MAX` is seldom that high as multi-million. — Some programmer dude, Nov 16 '18 at 09:45
"I assume that by considering the RAM speed which is more than 1GB/s, there should not be any problem in parallel write on RAM." have a look at https://stackoverflow.com/a/4087315/3975177 — Swordfish, Nov 16 '18 at 09:46
Also, your data is larger than available caches on any modern mainstream CPU, and could be spread out all over the physical memory leading cache reloads quite frequently. — Some programmer dude, Nov 16 '18 at 09:48
Currently, I don't care about data protection. I accept the error which may occur. I wonder to know why in this configuration the speed become worse by parallelism. — JGC, Nov 16 '18 at 09:48
But they are executing completely independent tasks without any lock. So, why parallelism should destroy the performance? — JGC, Nov 16 '18 at 09:51
Is it potentially a race condition and therefore undefined behavior whose performance you are measuring? Undefined behavior should be avoided not researched. — Öö Tiib, Nov 19 '18 at 08:41

score 4 · Accepted Answer · answered Nov 16 '18 at 09:53

4

All your threads are accessing the same data in a random fashion.

Each time one thread writes something to a location, all the cache lines that have this value are invalidated and must be updated. And this happens for all your threads all the time, invalidating data in all your caches at any single time.

It's not about the lock, it's about the fact that the cache lines from other cores that have the same data have to be invalidated. This comes at a cost.

Imagine that you have 1000000 screws on a small wall. You have ten people with ten screwdrivers. They will all get in the way of each other, because if they want to work on two screws that are int he same location, they can work efficiently. It's a little bit like this here as well, but with even more hierarchies.

answered Nov 16 '18 at 09:53

Matthieu Brucher

21,634
7
38
62

What is your suggestion to remove this constraint? For instance, defining separate variables or ... – JGC Nov 16 '18 at 11:32
Threads should access contiguous cachelines and be responsible for them. So align your data, actually use a big 1D array instead of a 2D, then split responsibility so that one thread tackle one part of the array. For instance. All depends on what you want to achieve in the end. – Matthieu Brucher Nov 16 '18 at 11:34
I repeated the experiment and replaced 2D array with some 1D arrays (each of which for a separate thread). But, I'm still getting the same result! – JGC Nov 16 '18 at 11:48

Simple array write in Multithread code is slower than Single thread

1 Answers1