Maximum single-sell profit - Parallelized version

Question

I am trying to use the OpenMP API (or pthreads) to parallelize the following code. Its time complexity is O(n). I am wondering if it's be possible to partition the entry array in X chunks (X = number of threads) and do the process in parallel for everyone of them.

That's a very classical algorithm problem, and I didn't see anyone trying to implement a parallelized version so far.

Important note: The simple reduction does not solve this issue since i'm reading the array only from the left to the right. So it's not that obvious to parallelize...

 #include<stdio.h>

/* The function assumes that there are at least two
   elements in array.
   The function returns a negative value if the array is
   sorted in decreasing order.
   Returns 0 if elements are equal  */
int maxDiff(int arr[], int arr_size)
{
  int max_diff = arr[1] - arr[0];
  int min_element = arr[0];
  int i;
  for(i = 1; i < arr_size; i++)
  {       
    if(arr[i] - min_element > max_diff)                               
      max_diff = arr[i] - min_element;
    if(arr[i] < min_element)
         min_element = arr[i];                     
  }
  return max_diff;
}

You have to care about the direction when we read the array. I'm reading from the left to the right. — , Oct 09 '14 at 18:42
Ok - so it's a little more subtle than max difference, you're looking for max difference between an item and a subsequent item, is that right? — Jonathan Dursi, Oct 09 '14 at 19:24
The updated version seems to do the right thing, but the additional restriction of the problem will further limit scaling. — Jonathan Dursi, Oct 09 '14 at 20:16

Jonathan Dursi · Accepted Answer · 2014-10-09T22:06:50.933

Because of the data dependencies and the low computational requirements, this is unlikely to ever give you much of a speedup in multicore - however, you can do something by calculating within each chunk of the array the local min, max, and local region best, and then compare that across chunks. Because of the final step, this runs in O(N) + O(P²) time, further limiting scalability.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <limits.h>
#include <omp.h>

void tick(struct timeval *t);
double tock(const struct timeval * const t);

unsigned int maxDiff(const int * const arr, const int arr_size)
{
  int max_diff = arr[1] - arr[0];
  int min_element = arr[0];
  int i;
  for(i = 1; i < arr_size; i++)
  {
    if(arr[i] - min_element > max_diff)
      max_diff = arr[i] - min_element;
    if(arr[i] < min_element)
         min_element = arr[i];
  }
  return max_diff;
}

unsigned int ompMaxDiff(const int * const arr, const int arr_size)
{
  int nthreads=omp_get_max_threads();
  int maxes[nthreads];
  int mins [nthreads];
  unsigned int best = 0;

  for (int i=0; i<nthreads; i++) {
    mins [i] = INT_MAX;
    maxes[i] = INT_MIN;
  }

  #pragma omp parallel num_threads(nthreads) default(none) shared(mins, maxes) reduction(max:best) 
  {
      int idx = omp_get_thread_num();
      int min = INT_MAX, max = INT_MIN;

      #pragma omp for schedule(static) 
      for(int i=0; i<arr_size; i++) {
        if (arr[i] < min) min=arr[i];
        if (arr[i] > max) max=arr[i];
        if ((arr[i] - min) > best) best = arr[i] - min;
      }

      mins [idx] = min;
      maxes[idx] = max;
  }

  for (int i=0; i<nthreads-1; i++)
    for (int j=i+1; j<nthreads; j++)
        if ((maxes[j] - mins[i]) > best) best = maxes[j]-mins[i];

  return best;
}

int main(int argc, char **argv) {
    const int nitems=1000000;
    int *data = malloc(nitems*sizeof(int));

    srand(time(NULL));
    for (int i=0; i<nitems; i++)
        data[i] = rand() % 500;    /* numbers between 0 and 500 */


    data[(nitems/2)+1] = -700;
    data[(nitems/2)]   = 700;      /* a trick! shouldn't get 1400, */
                                   /* should get <= 1200 */

    struct timeval start;
    tick(&start);
    unsigned int res = maxDiff(data, nitems);
    double restime = tock(&start);

    printf("Serial: answer = %u, time = %lf\n", res, restime);

    tick(&start);
    res = ompMaxDiff(data, nitems);
    restime = tock(&start);

    printf("OpenMP: answer = %u, time = %lf\n", res, restime);

    free(data);

    return 0;
}

void tick(struct timeval *t) {
    gettimeofday(t, NULL);
}

double tock(const struct timeval * const t) {
    struct timeval now;
    gettimeofday(&now, NULL);
    return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}

Running on 8 cores gives:

$ gcc -fopenmp -O3 -Wall -std=c11 maxdiff.c -o maxdiff
$ ./maxdiff 
Serial: answer = 1199, time = 0.001760
OpenMP: answer = 1199, time = 0.000488

@#$##@$@#! I was just about to give your fixed and correct solution! What I would say in addition is that reductions in OpenMP are assumed to be associative and commutative. However, the OPs function is not commutative but it's still associative (e.g. like matrix multiplication). In that case you have to store the results per thread (as you have done) and then operate in serial. For operations that are not commutative it's impossible to parallelize them. — Z boson, Oct 09 '14 at 20:25
Whoops - sorry, @Zboson :). On the other hand, there's probably significant improvements to this version that can be offered. For instance, it turns out (google "maximum sum subsequence parallel", which is a problem that this one can be transformed into) that with more computation in each parallel region you can improve the combination time from O(P²) to O(P). But for the small range of P that is likely to be helpful here, I strongly suspect the additional work isn't worth it. — Jonathan Dursi, Oct 09 '14 at 20:31
Instead of hardcoding the number of threads I would ask how many threads there are in your parallel team and then allocate the array inside the parallel section using a single statement. See Hristo Iliev's answer here http://stackoverflow.com/questions/16789242/fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic — Z boson, Oct 09 '14 at 20:35
@Zboson : Yeah - I should really fix that, plus manually decomposing the for loop (not relying on schedule(static) to decompose the loop the-way-it-probably-will-but-no-guarantee). — Jonathan Dursi, Oct 09 '14 at 20:40
In general I would assume that n >> p^2 so the the serial part should be negligible. I mean what's the largest p will be on a NUMA system? — Z boson, Oct 09 '14 at 20:44
I'm not sure I follow your argument about static scheduling. Are you worried that it might not assign the chunks in increasing thread number? I asked a question about that http://stackoverflow.com/questions/18746282/openmp-schedulestatic-with-no-chunk-size-specified-chunk-size-and-order-of-as — Z boson, Oct 09 '14 at 20:49
@Zboson - huh; I didn't realize the guarantees were so strong for no-chunk-size-specified; thanks for pointing that out, I learned something useful today. As for P, it's getting pretty easy to buy 30-40 core nodes today, and 40^2 is substantial (but likely still sub-dominant, as you point out). — Jonathan Dursi, Oct 09 '14 at 21:10
I made a mistake in my writing. I meant to say that for operations that are not associative that it's impossible to parallelize them. So associative is necessary whereas commutativity is only necessary if you want to use e.g. the `reduction` clause. — Z boson, Oct 10 '14 at 07:12
BTW, here is the question/answer where I figured out the associative but not commutative problem [c-openmp-split-for-loop-in-even-chunks-static-and-join-data-at-the-end](https://stackoverflow.com/questions/18745091/c-openmp-split-for-loop-in-even-chunks-static-and-join-data-at-the-end/18748272#18748272). — Z boson, Oct 10 '14 at 08:04

David Eisenstat · Answer 2 · 2014-10-09T14:55:44.960

I'm no sure about OpenMP in particular, but here's an associative operator for the problem that lends itself to parallelism.

struct intermediate {
    int min_elem;
    int max_elem;
    int max_diff;
};

Prepare a list of singletons using this function.

struct intermediate singleton(int x) {
    return (struct intermediate){x, x, INT_MIN};
}

Combine two adjacent intermediates using this function.

struct intermediate combine(struct intermediate a, struct intermediate b) {
    return (struct intermediate){min(a.min_elem, b.min_elem),
                                 max(a.max_elem, b.max_elem),
                                 max(max(a.max_diff, b.max_diff),
                                     b.max_elem - a.min_elem)};
}

One possible evaluation strategy can be drawn like so.

        C
       / \
      C   \
     / \   \
    /   \   \
   /     \   \
  C       C   \
 / \     / \   \
S   S   S   S   S
|   |   |   |   |
0   1   2   3   4

Here C means combine and S means singleton. Since combine is associative, any binary tree will do. Here's another strategy.

        C
       / \
      /   \
     /     \
    /       C
   /       / \
  C       /   C
 / \     /   / \
S   S   S   S   S
|   |   |   |   |
0   1   2   3   4

@Jeb11 You evaluate something like `combine(combine(singleton(a[0]), singleton(a[1])), combine(singleton(a[2]), singleton(a[3])))` for a four-element array. Since `combine` is associative, there is a lot of flexibility in the combining order. — David Eisenstat, Oct 09 '14 at 14:02
I'm reading the array only from the left to the right so your solution does not work. — , Oct 09 '14 at 18:46
@Jeb11 It's not the reduction that Jonathan is using. Did you try it? — David Eisenstat, Oct 09 '14 at 18:52

Maximum single-sell profit - Parallelized version

2 Answers2