7

I have being trying to write(update) in an array of doubles in a parallel way using OpenMP. The idea is that the element that needs to be updated can be updated more than once and the element itself is computed on-the-fly. This makes it very prone to race conditions unless I "lock" the memory region correspondent to the element that is being updated with an atomic operation. As the example below:

#include<omp.h>
#include<stdlib.h>

int main()
{

    double *x=NULL, contribution = 1234.5;
    size_t xlen=1000, i, n;

    if ( (x = (double *) calloc(xlen,sizeof(double)) ) == NULL) exit(-1)

    #pragma omp parallel for shared(x) private(n,contribution)
    for (i=0; i<xlen; i++) {
        n = find_n_on_the_fly(i);

        #pragma omp atomic update
        x[n]+=contribution; // in the more complicated case contributions are different

    }

    return 0;
}

I am running into race conditions with this approach still. I tried using critical sections but it completely kills me since the arrays are large and the number of updates is also large.

Question: what is wrong with this approach? Is there a better way of handling this?

Note: To handle this I am doing something stupid, by creating copies of the arrays for each one of the threads and reducing them later. But memory limitations don't allow me to go further.

Rubem_blah
  • 73
  • 5
  • 1
    why not create a list of `n`/`contribution` pairs an keep pushing into it, then reduce them into `x` array, or any other kind of sparse representation? – user3528438 May 26 '15 at 14:21
  • @user3528438 That is an idea and is in line with my note. Do you have an actual working example that does not reduce to what I am doing already? – Rubem_blah May 26 '15 at 14:58
  • By definition, an atomic operation for a 64-bit value is only possible with a 64-bit (or greater) hardware architecture. The floating point library may be at the root of the issue you are seeing here. Have you tried typecasting to an `unsigned long long`? Also, a hash table would be a lot faster for looking up array elements, rather than search through a loop for each element. – Jim Fell May 26 '15 at 18:07
  • @JimFell, you are right. Indeed, the outcome of my routine `find_n_on_the_fly(i)` is already after hashing. This is not the issue. The point is: for some values of `i` there will be some threads writing in the same array position `x[n]` at the same time. Regarding the first part of your comment, I don't get what you mean by typecasting. Type cast which variable? – Rubem_blah May 26 '15 at 18:53
  • @Rubem_blah Depending on how the floating point library is implemented, writes to/from a data type of `double` may or may not be atomic. Your line `x[n]+=contribution;` could be written as `y[n] += (unsigned long long)contribution;` where `y` is defined as `unsigned long long * y = (unsigned long long *)x;`. – Jim Fell May 26 '15 at 19:30
  • Your question is a duplicate of the most frequent OpenMP question https://stackoverflow.com/questions/16789242/fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic – Z boson May 27 '15 at 07:35
  • What do you define `x`? And where to you declare `array`? Is the range of `x` the same as `xlen`. – Z boson May 27 '15 at 07:45
  • @Zboson, I am sorry. It is hard to code when you don't make sure your code doesn't compile. There is just a single array of doubles x. – Rubem_blah May 27 '15 at 21:56
  • @Zboson, this [link](https://stackoverflow.com/questions/16789242/fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic) is not a proper answer. This is in line with the note at the bottom of my question where you replicate the memory (or extend the array) for each one of the threads. If the array is **really** large, you can't perform this operation in a machine with, say, 16 logical cores due to memory limitations. I am looking at arrays of locks using the function `omp_set_lock`. I will come back here when I am sure it works. – Rubem_blah May 27 '15 at 23:15
  • @user3528438, if the array after being filled is sparse then the OP should use a sparse representation (rather than an array of doubles) in the first place. – Z boson May 28 '15 at 10:34
  • I tried your example code using atomic. It works fine with 64-bit code. What's the problem? I don't see a race condition. As long as `find_n_on_the_fly` and calculating `contribution` are much slower than `a[n] += contribution` this method should be fine. If they are not slower then your operation is memory bandwidth bound anyway. – Z boson May 28 '15 at 12:22
  • http://stackoverflow.com/questions/1527305/disable-hardware-software-interrupts might serve as an ugly solution. – Scalable Jun 23 '15 at 20:16

2 Answers2

1

To me, above code seems efficient.

However, you have two options to improve it:

1- Using a memory allocation function called _mm_malloc with cache line size as one of the input instead of calloc that you used. The biggest thing that you are facing right now is False Sharing. In order to omit the effects of FS, by using above method, you are basically forcing the underlying library to allocate memory (or your array) in a way that each element is residing in a line on cache. This way, threads will not fight over a cache line to retrieve two different mutual variables. In another words, it will prevent allocation of two variables in a cache line. This will boost performance in a multi-threaded program. Google _mm_malloc for more info. However, following are the basic usage. In most modern computers, the cache line is 64.

#if defined(__INTEL_COMPILER) 
#include <malloc.h> 
#else 
#include <mm_malloc.h> 
#endif 

int CPU_Cache_line_size = 64;

int xlen = 100;
double *ptr = _mm_malloc(xlen*sizeof(double), CPU_Cache_line_size); 
/*
  do something ...
*/
_mm_free(ptr); 

__CPU_CACHE_LINE_SIZE__ can be queried for your system in following ways:

  • Linux command:

    getconf LEVEL1_DCACHE_LINESIZE

  • Programatically:

    int CPU_Cache_line_size = sysconf (_SC_LEVEL1_DCACHE_LINESIZE);

  • Detailed information regarding cache:

    /sys/devices/system/cpu/cpu0/cache/

2- You mentioned this yourself but the limitations are severe for your situation: using an array per thread and then reduce them. This approach is more efficient. Consider again to use this approach if you can.

mgNobody
  • 738
  • 7
  • 23
  • Note that `_mm_malloc` and `_mm_free` are only valid if you use Intel `icc` or `icpc` compilers. Consider using other options if you build your program using a different one. POSIX alternative for aligned allocation is `posix_memalign`: https://linux.die.net/man/3/posix_memalign . C11 standard version is `aligned_alloc` ( http://en.cppreference.com/w/c/memory/aligned_alloc). – Jorge Bellon Oct 05 '16 at 20:06
0

I'll preface this with "I'm very new to parallel computing" and this may not be viable depending on your use case.

One idea that may be useful is to separate the computational and updating aspects of the program. Establishing a master/control process/thread that has sole access to the array and performs updates in a queue fashion from information gathered from slave/computation processes/threads. This can serve to minimize the need for locking in the traditional sense.

I hope that at least provides a different way of looking at the problem.

Dan
  • 383
  • 1
  • 4