Is there a way to parallelize efficiently with a shared memory?

Question

I am running a simulation where I have to compute the energy of my system. The system primarily has two types of particles, and in my energy computation, I loop through all instances of each type of particle. The serial loop version looks something like this:

double CalculateEnergy (std::vector <Polymer>* Polymers, std::vector <Particle*>* Cosolvent, std::vector <Particle*>* LATTICE, std::array<double,8>* E, std::array<double,8>* contacts, int x, int y, int z) {
    
    double Energy {0.0};
    (*contacts) = {0,0,0,0,0,0,0,0}; 

    // loop through all the particles in Polymers 
    for (Polymer& pmer: (*Polymers)) {
        for (Particle*& p: pmer.chain){
            // perform an energy computation 
            Energy += (*LATTICE)[f(*p)]; // f is an arbitrary function that uses a property of p. 
        }
    }
    // loop through all particles of the cosolvent 
    for ( Particle*& p: *Cosolvent ){

        // perform an energy computation 
        Energy += (*LATTICE)[g(*p)]; // g is an arbitrary function that uses properties of p. 

    }
    
    return Energy; 
}

Yes, I am using raw pointers in C++, but it does the job well enough and I see no reason to change it so far.

The idea is that I loop through Polymers, calculate energetic contribution of each particles in that set, then I loop through Cosolvent, and calculate the energetic contribution of each particle in that set. As the number of particles in my system grows, running this serially would be inefficient, timewise. So I want to parallelize this energy computation: where 1 thread does the Polymers loop, and the other thread does the Cosolvent loop.

This is what I have done:

double CalculateEnergy_parallelized (std::vector <Polymer>* Polymers, std::vector <Particle*>* Cosolvent, std::vector <Particle*>* LATTICE, std::array<double,8>* E, std::array<double,8>* contacts, int x, int y, int z) {
    
    double Energy {0.0};
    (*contacts) = {0,0,0,0,0,0,0,0}; 
    double E1;
    double E2; 


    std::thread t1 (PolymerEnergyLoop, Polymers, LATTICE, E, &E1, contacts, x, y, z);
    std::thread t2 (CosolventEnergyLoop, Cosolvent, LATTICE, E, &E2, contacts, x, y, z); 
    // CosolventEnergyLoop ( Cosolvent, LATTICE, E, &E2, contacts, x, y, z);  

    t1.join();
    t2.join();

    Energy = E1 + E2; 

    // std::array <std::array <int,3>, 26> ne_list; 


    return Energy; 
}


void PolymerEnergyLoop (std::vector <Polymer>* Polymers, std::vector <Particle*>* LATTICE, std::array<double,8>* E, double* Ea, std::array<double,8>* contacts, int x, int y, int z){

    double Energy {0.0};
    (*contacts) = {0,0,0,0,0,0,0,0}; 

    // loop through all the particles in Polymers 
    for (Polymer& pmer: (*Polymers)) {
        for (Particle*& p: pmer.chain){
            // perform an energy computation 
            Energy += (*LATTICE)[f(*p)]; // f is an arbitrary function that uses a property of p. 
        }
    }

    *Ea = Energy; 
    return; 
}


void CosolventEnergyLoop (std::vector <Particle*>* Cosolvent, std::vector <Particle*>* LATTICE, std::array<double,8>* E, double* Ea, std::array<double,8>* contacts, int x, int y, int z){

    double Energy {0}; 

    std::array <std::array <int,3>, 26> ne_list; 

    for ( Particle*& p: *Cosolvent ){

        // perform an energy computation 
        Energy += (*LATTICE)[g(*p)]; // g is an arbitrary function that uses properties of p. 
    }
    
    *Ea = Energy; 
    return; 
}

However, the parallel version is slower than the serial version, by about 25%.

I know the problem of shared memory which will arise from sharing LATTICE. LATTICE is the superset of Polymers and Cosolvent, and I am asking for information from it repeatedly. Is this the problem of memory sharing in parallel processing? If so, what are the methods of overcoming it?

A side note about terminology: the term "shared memory" means something else: a "special" piece of memory that can be accessed by **different processes**. See: https://en.wikipedia.org/wiki/Shared_memory. In your case it seems like just a "regular" piece of memory being accessed by multiple threads **within a process**. This is possible to do with every chunk of memory allocated on the heap. — wohlstad, Sep 16 '22 at 04:48
Sharing memory between threads to *read from* isn't bad, it's writing that can cause trouble, even when the memory being accessed by the threads [is not the same, but in the same cache line](https://stackoverflow.com/q/22766191/555045). — harold, Sep 16 '22 at 04:52
_"I am using raw pointers in C++, but it does the job well enough"_ mandatory inquisition into _why_ do you need to point to a `std::vector` in the first place. Are you aware you only need to do `std::vector polymer;` instead of `auto polymer = new std::vector;`? — Passer By, Sep 16 '22 at 07:35
Also, please provide a [mre]. There's no way to tell what's going on currently. — Passer By, Sep 16 '22 at 07:38
There is a loop-carried dependency that prevent the loop to be (naively) parallelized. Indeed, `contacts` can be partially written by possibly any iterations and the order matters. Besides, we do not know if `obtain_ne_list`, `take_dot_product`, `lattice_index`, `modified_direction`, etc are thread safe. If they are not, then the code is broken. Running the two loops is a good idea but there is a race condition on `contacts` causing potentially wrong results on your parallelisation attempt. Race condition usually impact performance. Finally, I advise you to use OpenMP for this. — Jérôme Richard, Sep 16 '22 at 09:18
With the last version, the code can be parallelized using parallel reductions assuming `f` and `g` are thread-safe. I guess the timing issue does not applies anymore. The mean reason for this code to be slower in parallel would be due to the time to create thread (meaning that the code runs for a very short time like <1 ms). Another reason that could explain a slowdown would be if you run this code on a hardware with multiple NUMA nodes. If so, please specify the processor configuration (CPU details and number of sockets) — Jérôme Richard, Sep 16 '22 at 17:54
I understand @JeromeRichard. I am going to run this and see how well it goes! This might be silly, but this what i see after ```lscpu```Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 1 Core(s) per socket: 24 — bad_chemist, Sep 16 '22 at 19:29
Note that `lstopo` can be more convenient to get such information. Since you run this on a 2-socket system, you may experience NUMA effect that can slow down your code up by a factor of two. You need to control the NUMA policy with numactl and bind the thread. Binding C++ threads is quite cumbersome (either not portable or tricky) compared to OpenMP. See [this post](https://stackoverflow.com/questions/64409563/64415109#64415109) and [this one](https://stackoverflow.com/questions/62604334/62615032#62615032) for more information. — Jérôme Richard, Sep 16 '22 at 19:41
In your case, I advise you to run your application only on the first socket just to test performance. numactl can be use to do that (with `--physcpubind`). You may need to specify the policy to a local one in this case (or specify also the memory node so to be sure). On multiple socket, you can use something like `numactl --interleaved=0,1` but C++ threads may still migrate between NUMA node and so they must be manually pinned. — Jérôme Richard, Sep 16 '22 at 19:45

Is there a way to parallelize efficiently with a shared memory?

0 Answers0