3

My requirement is like this: every thread allocates memory itself, then processes it:

typedef struct
{
    ......
}A;

A *p[N];

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0; i < N; i++) {
        p[i] = (A*)calloc(sizeof(*p[i]), N);
        if (NULL == p[i]) {
            return;
        }
        ......          
    }
}

But the compiler will complain:

error: invalid exit from OpenMP structured block
     return;

So except put the allocating memory code out of the #pragma omp parallel:

for (int i = 0; i < N; i++) {
    p[i] = (A*)calloc(sizeof(*p[i]), N);
    if (NULL == p[i]) {
        return;
    }       
}
#pragma omp parallel
{
    #pragma omp for
    ......
}

Is there any better method?

Nan Xiao
  • 16,671
  • 18
  • 103
  • 164
  • Do you really need to return immediately if any calloc() fails? If not, then there's no reason to do anything fancy. If so, then you'll have to implement your own abort. E.g.: http://www.thinkingparallel.com/2007/06/29/breaking-out-of-loops-in-openmp/ – David Jan 27 '17 at 03:37
  • Indeed, it does not make sense for one of the worker threads to return from inside a parallel block. Only the thread that called the function containing the parallel block can return from that function. – John Bollinger Jan 27 '17 at 04:16
  • 2
    Having memory allocation in a parallel loop might not be a good idea. I would generally assume that `calloc` & friends use locks, so you will get a lot of contention. So you second version, with the master thread allocating all memory, may actually perform better. That is assuming again assuming that the central `calloc` does not mess with NUMA / first touch policies. That is a [complex topic](http://scicomp.stackexchange.com/q/2028), but can really be critical for performance. Intuitively I would go with the *serial allocation* and then measaure. – Zulan Jan 27 '17 at 08:51
  • @Zulan the first version can allocate the memory locally to each core whereas the second version allocates the memory only for the master core so the first version could be better. – Z boson Jan 27 '17 at 09:00
  • @Zboson as I tried to indicate that depends on many factors and is not generally true. See for example *... keep in mind that the physical memory allocation on Linux happens at the first touch (and not at the malloc-point). ...* https://software.intel.com/en-us/articles/memory-allocation-and-first-touch – Zulan Jan 27 '17 at 09:04
  • @Zulan, the OP used `calloc` not `malloc`. – Z boson Jan 27 '17 at 09:05
  • @Zboson `calloc` usually doesn't touch the memory either, it just allocates zero-pages. – Zulan Jan 27 '17 at 09:07
  • @Zulan, yeah, I was not sure about that..It has to do with security and not giving people memory that could contain sensitive info so you get cleared pages instead... – Z boson Jan 27 '17 at 09:08
  • @Zulan, here is what I was referring to http://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc – Z boson Jan 27 '17 at 09:27
  • @Zulan, okay I see your point now. In the second case the memory could be allocated locally when it is first touched by a thread. The difference is then in the first case is is guaranteed to be allocated by the thread that request the memory whereas the second case there is no guarantee. – Z boson Jan 27 '17 at 10:09
  • @Zboson if you need a guarantee, you will have to employ explicit NUMA control. It's not clear though this is the case for this question as `......` is under-specified ;-) – Zulan Jan 27 '17 at 10:24

2 Answers2

6

You're looking for this, I think:

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0; i < N; i++) {
        p[i] = (A*)calloc(sizeof(*p[i]), N);
        if (NULL == p[i]) {
            #pragma omp cancel for
        }
        ......          
    }
}

But you'll need to set the environment variable OMP_CANCELLATION to true for this to work.

You should try to avoid doing this, though, because cancellation is expensive.

Richard
  • 56,349
  • 34
  • 180
  • 251
3

You could try this

omp_set_dynamic(0); //Explicitly turn off dynamic threads
bool cancel = false;    

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
    p[i] = (A*)calloc(sizeof(*p[i]),N);
    if (NULL == p[i]) cancel = true;
}
if(cancel) return;
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
    ......   
}

This could allocate the memory local to each core/node. I turned off dynamic adjusting the number of threads and used schedule(static) to make sure the threads in the second for loop access the same memory allocated in the first for loop.

I don't know if this solution would be any better. According to this comment it could be worse. It could make a big difference if you have a multi-socket (NUMA) system or not.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226