5

I am currently trying to speed up a simple matrix subtraction benchmark with OpenMP on the Maestro processor, which has a NUMA architecture and is based on the Tilera Tile64 processor. The Maestro board has 49 processors arranged in a two-dimensional array in a 7x7 configuration. Each core has its own L1 and L2 cache. A layout of the board can be seen here: https://i.stack.imgur.com/RG0fC.png

I am new to the idea of writing applications that are 'NUMA-aware', but the main consensus from what I've read is that data locality is a big part of maximizing performance. When parallelizing code among the cores, I should keep the data being used local to the thread doing the processing as possible.

For this matrix subtraction benchmark (C[i] = A[i] - B[i]), I thought it would be a good idea to allocate each thread its own private A, B, and C arrays with the size being the total work size divided by the number of threads. So for example if the total size of the arrays were 6000*6000 and I was trying to parallelize it across 20 threads, I would allocate private arrays with size (6000*6000)/20. Each thread would do this subtraction on its own private array and then I would gather the results back into a final array of the total size 6000*6000. For example (without the gathering of results from each thread into a final array):

int threads = 20;
int size = 6000;
uint8_t *C_final = malloc(sizeof(uint8_t)*(size*size));
#pragma omp parallel num_threads(threads) private(j)
{
     uint8_t *A_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
     uint8_t *B_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
     uint8_t *C_priv = malloc(sizeof(uint8_t)*((size*size)/threads));

     for(j=0; j<((size*size)/threads); j++)
       {
            A_priv[j]=100;
            B_priv[j]=omp_get_thread_num();
            C_priv[j]=0;
       }

     for(j=0; j<((size*size)/threads); j++)
       {
           C_priv[j] = A_priv[j]-B_priv[j];
       }
}

The initial values for the arrays are arbitrary, I just have omp_get_thread_num() in there so I get different values in C_priv from each thread. I'm currently experimenting with the User Dynamic Network that the board has that provides hardware to route packets between CPUs in order to accumulate all of the individual thread results into a final resulting array.

I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup.

Is this a proper way to go about this type of problem? What type of techniques should I look into for getting speedup on a NUMA architecture for a problem like this that uses OpenMP?

Edit:

For clarification, this is what I originally tried and where I noticed a slower execution time than if I just ran the code serially:

     int threads = 20;
     int size = 6000;
     uint8_t *A_priv = malloc(sizeof(uint8_t)*(size*size));
     uint8_t *B_priv = malloc(sizeof(uint8_t)*(size*size));
     uint8_t *C_priv = malloc(sizeof(uint8_t)*(size*size));

     int i;
     for(i=0; i<(size*size); i++)
     {
       A[i] = 10;
       B[i] = 5;
       C[i] = 0;
     }

     #pragma omp parallel for num_threads(threads)
     for(i=0; i<(size*size); i++)
     {
       C[i] = A[i] - B[i];
     }

After seeing that I was getting a slower execution time when using OpenMP, I tried looking into why that is the case. It seemed as though data locality was the issue. This assumption is based on what I have read up about NUMA architectures.

I am having a hard time trying to figure out how to alleviate the bottlenecks that are slowing it down. I found some help with similar questions like this: OpenMP: for schedule where it walks about allocating data to each thread so each thread works on its local data.

I just feel like something as simple as a matrix subtraction should not be difficult to get increased performance when using OpenMP. I'm not sure how to go about figuring out what exactly the bottleneck is and how to alleviate it.

Community
  • 1
  • 1
Shaun Holtzman
  • 201
  • 2
  • 9
  • Have you considered using message passing (MPI) instead? With MPI you have more explicit control over memory layout and communication between processes. – simpel01 Feb 22 '17 at 04:21
  • 1
    I think you are mixing up NUMA, caches and data locality. A detailed answer to your question would be very broad **and** require extensive knowledge of the NUMA memory allocation policies on your system **and** require more details on the memory access pattern in your app. A general answer is to keep your code high level until a measurement reveals a significant performance issue. Making a general recommendation without basing it on a specific measurement result is unproductive. I am also unsure why you would even need/want to accumulate the results if data resides in shared memory anyway. – Zulan Feb 22 '17 at 11:53
  • I added an edit to my original question to show what I initially tried, which was just a simple OpenMP for loop where I saw a decrease in performance when compared to running the subtraction serially. – Shaun Holtzman Feb 22 '17 at 19:44
  • is the performance low or this is just a premature optimization? – huseyin tugrul buyukisik Feb 22 '17 at 19:47
  • 1
    If I do a simple OpenMP for loop (edited in an example to my original question) I see worse performance than if I just ran it serially. This is not just the case with this matrix subtraction I'm doing, I've seen the same case with, for example, matrix multiplication, but I am trying to start off with something as simple as possible. When I break up the allocation into private arrays for each thread, I see an increased performance, but now each thread has it's own array of results rather than one accumulated result. – Shaun Holtzman Feb 22 '17 at 19:50
  • I'm not sure if this approach is adequate, and am seeking some pointers/example if possible on how to find out what the bottleneck is/could be and how to go about alleviating it. – Shaun Holtzman Feb 22 '17 at 19:52
  • @ShaunHoltzman: For NUMA; you want the RAM being used by a CPU to "close" to that CPU. Allocating all the RAM in the same process using something as crippled as `malloc()` can not possibly achieve that. – Brendan Feb 22 '17 at 21:23
  • I don't know about the Maestro processor or board. Can you provide a link or two describing this? – Z boson Feb 23 '17 at 08:07
  • 1
    Sure, it's kinda hard to find documents describing the Maestro. It's a radiation hardened space-grade processor based on the Tilera TILE64 processor. This is probably the best detail I can find about the Maestro: http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA600320. Most of the documentation that I use about development on the board is from datasheets for the Tilera TILE64. – Shaun Holtzman Feb 23 '17 at 17:03
  • @simpel01 I have been considering giving MPI a try. It's probably what I am going to try next if I can't figure out why OpenMP is not giving me any speedup when using a basic for loop parallelization. – Shaun Holtzman Feb 23 '17 at 17:11

1 Answers1

1

On a quick search and scan of the TILE64 datasheet, it doesn't look like the architecture exposes performance counters like what you'd use on x86 via tools like oprofile, VTune or xperf. Without those, you'll have to devise some experiments of your own to iteratively narrow down on what portion of the code is hot and why - in the absence of microarchitectural docs along with tools to indicate how your code is exercising the hardware, a bit of a reverse engineering task.

Some ideas about where to start on that:

  1. Do some scaling experiments. Is there a knee in the curve where going over a certain problem size or number of threads has a big effect on the overall performance? Does that number hint at some clear relationship with the size of a certain level in the memory hierarchy, or a dimension of the grid of processors, or similar?
  2. Record execution times at a few points through the program. It would probably be useful to know, for example, at a high level how much time is spent on the mallocs vs. the first loop vs. the second.
  3. "I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup." - this worry is also empirically testable, especially if you're working on a large enough problem size that your timer accuracy as in (2) is not an issue for isolating time taken for the gather step vs. the part that's completely parallelizable.
  4. Try a different operation - say, addition or element-wise division instead of subtraction and see if that changes the results. On many architectures different arithmetic operations have different latency and throughput. If you looked up and found that that was the case for the TILE64, making a change like this and instrumenting the runtime of your second example might tell you something useful about how much of the time spent over running it serially actually has to do with data locality issues vs. startup time or other overhead related to the OpenMP runtime that might have more to do in the overall results with its relationship to a small problem size than with the properly parallel part of the parallel implementation actually running slower.
  5. You could examine generated assembly. The assumption that the compiler would do basically the same things in the examples you've posted seems reasonable, but doesn't necessarily hold as strongly as you would want it to when looking at odd performance. Maybe there's something about the code size or layout that changes with/without OpenMP or when moving from one parallel approach to another, like use of instruction cache, availability of reservation station or ROB entries (if the TILE64 has those things)...? Who knows, until you look.
Aaron Altman
  • 1,705
  • 1
  • 14
  • 22