-1

I am developing an OpenMP code in C++ (the compiler is g++ 4.8.2). In a part of my code I need to perform an add atomically on a struct data. The strcut is defined as:

struct real3 
{
  float x;
  float y;
  float z;
};    

and I defined addition operator for it as follows:

inline real3 operator+(real3 a, real3 b)
{ 
  return make_real3(a.x + b.x, a.y + b.y, a.z + b.z);
}

I use this strcut in all parts of my code. In one part of my program, I need to perform a add operation atomically:

real3 * m_cforce;
real3 fn, ft;
int i;
/*
 . . . .  some code is here 

*/

#pragma omp atomic
m_cforce[i] = m_cforce[i] + (fn + ft);

The compile does not accept the struct real3 as the operands for atomic add. one solution is to use the following code instead:

#pragma omp atomic
m_cforce[i].x = m_cforce[i].x + (fn + ft).x;
#pragma omp atomic
m_cforce[i].y = m_cforce[i].y + (fn + ft).y;
#pragma omp atomic
m_cforce[i].z = m_cforce[i].z + (fn + ft).z;

in this way, I use atomics 3 times more, and it will cost more time for me. Is there any save way for me to do this operation with lower computational overhead?

  • As a start, store `fn + ft` somewhere. You could use `critical`, but it's overhead is greater than atomics. – dmg Jan 28 '15 at 12:23
  • 1
    @MohitJain I removed it. – user3672271 Jan 28 '15 at 12:25
  • @dmg I used critical and it becomes 5 times slower than using three atomic add. since it keeps one thread running at a time, but atomic locks the memory location and allows other threads keep running, instead of those who is going to access the same memory location. Thank you for your quick reply. – user3672271 Jan 28 '15 at 13:34
  • @dmg - I used critical and it becomes 5 times slower than using three atomic add. since it keeps one thread running at a time, but atomic locks the memory location and allows other threads keep running, instead of those who is going to access the same memory location. Thank you for your quick reply. – user3672271 Jan 28 '15 at 13:35
  • @user3672271 Naturally. `atomic`s rely on the hardware to provide fast lock-less operations, while `critical` is a general-purpose lock-full approach. – dmg Jan 28 '15 at 13:39
  • Have you considered a `lock` solution ? http://stackoverflow.com/questions/2396430/how-to-use-lock-in-openmp You could try an array of lock to specifically lock an element. – dkg Jan 28 '15 at 14:25
  • Change what your doing. Instead of updating the shared struct each iteration make private versions of the struct for each thread and then updated the shared struct after all iterations in a critical section. – Z boson Jan 28 '15 at 17:25
  • @Zboson: Your idea is great and works well when there is a few numbers of bins which should be updated frequently. The problem that I demonstrated here, is not complete. There is another update on another memory location (which is random). So there is a data dependency between threads (each thread updates its corresponding memory location and another random memory location too). Your solution works for me if I allocate a private vector of the real struct and at the end of parallel section I do a reduction sum on them. I should test this. Thank you very much for your comment. – user3672271 Jan 28 '15 at 20:19
  • @dmg: I read texts and they warned me about using locks "the program may get stuck into the dead locks". I am not quite sure that this happens to my case. I know how to implement it, but I am not sure that it works well in any platform. I dont know about the exact mechanism of it. Is the mechanism of the lock similar to the atomics? – user3672271 Jan 28 '15 at 20:26
  • If you get stuck in a deadlock your program won't finish. Atomics just don't use locks and are faster. – dmg Jan 28 '15 at 21:48
  • BTW have you done any performance measuring. – dmg Jan 28 '15 at 21:55
  • @dmg: I measured the performance with atomic and critical, and the atomic add was 5 times faster (as I told you in previous comments). But not with locks yet. I will inform you about the performance of lock. – user3672271 Jan 29 '15 at 07:58
  • There is no need to test locks, as `critical` most likely uses locks. My question is, "What speedup do you expect to get and what speedup to you get compared to the serial code?" – dmg Jan 29 '15 at 08:00
  • @dmg: My PC is a dual core one. I will test the real performance of the program and will do optimization on a 48-core station after I finished with code development. I have not tested the performance of the whole program yet. I expect more than 8 times faster code (theoretically it would be 14) on a 48-core station. I almost do all the calculations in parallel (more than 95% of computational load), except: writing results and initializing part. – user3672271 Jan 29 '15 at 08:35

2 Answers2

1

OpenMP atomics need to operate on a scalar value (simple type). Atomics are intended to be mapped to kernel or even instruction level atomics by the runtime. Without knowing more about your problem, it is hard to give one answer, but some usual suggestions for this sort of thing:

  • Use thread local variables (if possible). If only one thread can write to it you can avoid atomics for much of the computation and then just reduce them at the end.
  • The 3 atomic approach you have above will work but it does allow for multiple threads to add to the same real3 possibly interleaved. Since floating point isn't associative this could lead to more non-deterministic results. Overall, this isn't a bad option.
  • Use an OpenMP lock per shared real3. Using critical is like using a single lock around the contained code segment. If you use a lock per real3, they could operate in parallel as long they are touching different real3s. OpenMPs locks aren't the fastest, but they should be faster than critical.
user2548418
  • 1,531
  • 10
  • 17
0

@user2548418:

This is multi-body simulation where all bodies (ranging from 0 to N-1) interact with each other. In the first step there is a search to determine pair-wise interactions. Then there is a loop over all bodies to calculate the interactions (this is a parallel section with a parallel for with counter i ). Simply, each body (i) may interact with between 0 and 20 bodies (j) in the system. When I calculate the interaction forces, I should add these forces to the total force of each body (i and j) the code I demonstrated above is the part in which this addition is being performed, so if you want this part complete consider the following:

#pragma omp atomic
m_cforce[i].x = m_cforce[i].x + (fn + ft).x;
#pragma omp atomic
m_cforce[i].y = m_cforce[i].y + (fn + ft).y;
#pragma omp atomic
m_cforce[i].z = m_cforce[i].z + (fn + ft).z;

#pragma omp atomic
m_cforce[j].x = m_cforce[j].x - (fn + ft).x;
#pragma omp atomic
m_cforce[j].y = m_cforce[j].y - (fn + ft).y;
#pragma omp atomic
m_cforce[j].z = m_cforce[j].z - (fn + ft).z;

Since each thread is going to modify two different memory locations: one in i location which corresponds to iteration counter and one in j location which is random and may be any number ranging from 0 to N-1. This dependency does not allow me to me to consider a private scalar real3 for each thread. One solution exists. I allocate a vector of real3[N] for each thread and then using a reduction sum at the end of "parallel for" to find the total force on each body. This can be challenging, since it can reduce the catch efficiency (atomics also do the same). So, I should compare this solution with using atomics.

I know that floating point operations is not associative. But it is not restrictive in my case, since the order of forces is not very different and there would be very low errors when the operation a+ (b+c) becomes (a+b) + c.

I am not going to use locks in my code, otherwise I get a very bad performance for the first and the second solutions.

  • @user2548418: please see above post – user3672271 Jan 29 '15 at 21:22
  • From the iteration variable i, how does it determine what j to interact with? Would it be possible to turn the nested loop of j within i to one loop with i*j iterations? You might be able to remove all atomics if threads only add to the main iteration variable. This will involve computing the forces twice (once at each end), but this will remove the need for all atomics, – user2548418 Jan 29 '15 at 21:32
  • @user2548418 J is completely random. N is of order of 100,000 to 1000,000. I really can not determine j. – user3672271 Jan 29 '15 at 21:41
  • At a minimum, you could use half the atomics if you duplicate m_cforce. When writing for i, don't use atomics and operate on one copy of m_cforce. When writing j use the other copy and do use atomics. At the end, add the two copies without atomics. – user2548418 Jan 30 '15 at 00:43