1

The Principle

I know, such a simple calculation wouldn't be worth to get parallized elaborately. It's such an example and the mathematical operation is just a placeholder for some more interesting computations.

[Pseudo code]

var id = 0,
do {
    id = getGlobalId();
    output[id] = input[id] * input[id];
} while (inRange(id) && output[id] !== 25);

The most special expression might be: output[id] !== 25. That means: If input has four elements (in this order): [8, 5, 2, 9], then output should be [64, 25] and the square of 2 or 9 wouldn't be used as item of output (because output[id] !== 25 is true for id = 1 and input[id] = 5).

If you are optimizing this piece of code, you might want to compute the square of every input[id] ahead of time (without proving the second while condition), but there's no guarantee that the result is relevant later on (if the result of an previous computation was 25, the result of the current computation is uninteresting).

Generalized, I'm talking about cases where the computation result output[id] (output[id] = calculateFrom(input[id]);) is maybe not relevant for every id - the need of the result (output[id]) depends one the result of another computation.

My Goal

I want to execute this loop as parallel and high-performance as possible using OpenCL kernels and queues.

My Ideas

  • I thought: In order to be able to parallelize such do...while loops we should do some computations (output[id] = calculateFrom(input[id]);) simultaneously ahead of time (without knowing if the result output[id] will be useful). And if the result of a previous was 25, then the result output[id] simply gets rejected.

  • Maybe we should think about the probability of output[id] !== 25. If the probability is very high we won't do many computations ahead of time because their results probably get rejected. If the probability is absolutely low, then I should do more computations ahead of time.

  • We should listen to the current status of the processing unit. If it's already overstrained, we shouldn't do unimportant ahead-of-time computations. But if there are enough resources to process the ahead-of-time computations, why not then. - Because: If the ahead-of-time computations and the previous computations (on which these ahead-of-time computations rely on) are processed at the same, then the ahead-of-time additional might also slow the previous computations down - (See my second question)

My Questions

  1. Is it wise or high-performance to parallelize such programs?
  2. Based on which criteria should I decide if the processing unit has enough resources to do my ahead-of-time computing things? Or: How can I know if my processing unit is too overstrained?
  3. Do you know about any other plan for parallelizing such do...whiles? Do you have any idea concerning that?

I hope it's always clear what I want to tell you. But if it isn't, please comment my question. - Thanks for your answers and your help.

fridojet
  • 1,276
  • 3
  • 15
  • 29
  • 1
    There is no ready answer, you should see your exact case for yourself. If probability of having `output[id] == 25` for each `id` is 10%, then, on average, you would have to process 9 elements, and, most likely, it does not worth parallelizing. If probability is .001%, you will have to process 99 999 elements an average, and if your array has 10 000 elements, you can compute them all at first, and throw away garbage afterwards; but if your array has 1 000 000 elements, you can process it in chunks, doing each 100 000 in parallel and using some atomic flag to watch for `25` – aland Sep 05 '12 at 05:29
  • @aland Thanks for your graphic description. - But, as described in my question text, I don't want to make the "degree of prallelization" (respectively, the size of the work-group) just depend on the probability that the condition applies. I want to make it **depend on the current state of the processing unit** too (is it overstrained or not?). *(There's more concerning that in the question and in the comments to mfa's answer)* – fridojet Sep 05 '12 at 17:54
  • @aland So, do you have an idea about an **algorithm** which can decide, **how much ahead-of-time computing** and prallelizing we should do, based on the information provided by `clGetDeviceInfo()`? - Thanks for any thoughts about that. – fridojet Sep 05 '12 at 18:00
  • 1
    @frodijet: In case of OpenCL, you usually have no dynamic control of device load. You prepare the data, launch the kernel over this data and wait for it to finish. The optimal size of workgroup depends on the device and algorithm. You can use CUDA Occupancy calculator if you use nVidia GPUs to calculate device occupancy, or do it manually (described in most OpenCL optimization guides). For GPU you should aim at tens of thousands of threads, and keep in mind that on some systems there is ~30 second timeout for execution of kernel, that might give you some estimate. – aland Sep 05 '12 at 18:19

1 Answers1

1

This work can be easily done in parallel if you use a single work group and take advantage of the local memory of the device. opencl spec says there will be at least 16kb memory available for this purpose.

some pseudo ocl code:

__kernel doWhileTest(...){

  local int outputBuff[groupSize];
  local int loopBreak[groupsize];

  loopBreak[localId] = 0;
  barrier();

  for(int i = localId;loopBreak[0]==0;i+=groupSize){
    if(i<maxIndex){
      //do interesting calculation on globalInputValues[i]
      //save result to outputBuff[localId]
      //condition failed? set loopBreak[localId] to 1
    }else{
      //set loopBreak condition here as well
    }

    barrier();
    if(localId ==0){
      //loop through all loopBreak values (for j = 0..groupSize)
      //0? copy outputBuff[j] to global memory (globalInputValues[i+j])
      //1? set loopBreak[0] to 1 as well and break this inner loop
    }
    barrier();
  }

  //optional: write some default value to the remaining global buffer, or save a count of the valid outputs

}

The for loop above may look a bit strange in that the index is being compared to maxIndex inside the loop, and not in the for statement. This is because all work items need to reach both barriers inside the loop, and this is not possible if some work items break out early (ie maxIndex is not a multiple of groupSize). Related question re: barriers

Also, I avoid global atomic writes here because they generally don't perform as well as the local shared data/scratchpad memory, especially since you read/write entire ranges of data at once.

Using the above design, it would be possible to pre-calculate some of the values, without going overboard with the processing. You would only ever discard at most groupSize results, while making the loop parallel. You can also start many of these work groups at the same time, but they would each break out (or not) based on their own data, and not a global value or break condition. Maybe keep groupSize low-ish in this case, so the groups that don't break wont spend too much time crunching. (Don't try a using global break value either. I have spin-locked my gpu trying this, and got a white screen of death.)

Community
  • 1
  • 1
mfa
  • 5,017
  • 2
  • 23
  • 28
  • Thanks for your code. - But I also was talking about how to find out **which `groupSize` would be ideal**. If 'groupSize' is small, then there's too litle parallelism, if `groupSize` is big, then I do many operations which might get rejected. --- It wouldn't be a real problem to throw away some unneeded operation results. But I'm afraid of slowing down the processing by doing **too many (maybe unneeded) operations** when the processing unit is already **overstrained**. – fridojet Sep 05 '12 at 16:40
  • So do you have an idea **how to choose the ideal `groupSize`** dynamically by analysing the information provided by `clGetDeviceInfo()`? - Thanks for any tip concerning that. – fridojet Sep 05 '12 at 16:44
  • I think It depends on what the odds are that your group will hit an early break condition. All work items will loop at least once, and if none break out of the loop they ALL loop again. My rough example above requires 2 ints of LDS per work item, so that is a possible maximum. (could be reduced to 1 int and 1 char though.) I have a general answer to the work group size question here: http://stackoverflow.com/questions/10096443/what-is-the-algorithm-to-determine-optimal-work-group-size-and-number-of-workgro/10098063#10098063 And you need to benchmark the results as always. – mfa Sep 06 '12 at 02:24