Parallel prefix sum - fastest Implementation

Question

I want to implement the parallel prefix sum algorithm using C++. My program should take the input array x[1....N],and it should display the output in the array y[N]. (Note the maximum value of N is 1000.)

So far, I went through many research papers and even the algorithm in Wikipedia. But my program should also display the output, the steps and also the operations/instructions of each step.

I want the fastest implementation like I want to minimise the number of operations as well as the steps.

For example::

x = {1, 2, 3,  4,   5,   6,   7,  8 } - Input
y = ( 1, 3, 6, 10, 15, 21, 28, 36) - Output

But along with displaying the y array as output, my program should also display the operations of each step. I also refer this thread calculate prefix sum ,but could get much help from it.

What's your specific problem? That seems like a very straightforward algorithm is sufficient. — Niklas B., Apr 07 '12 at 10:23
@ Niklas B::I actually want ,that my program should use the minimum steps and min operation.Like if N is 1000,my program should use steps less than 20 and operations less than 2100. — , Apr 07 '12 at 10:29
Try to write one yourself! Just sum up the numbers in a loop. — Niklas B., Apr 07 '12 at 10:43
Did you implement the algorithm in the Wikipedia article on parallel prefix sum? If so, then post it here or on ideone, and we will help you with the "display the output, the steps and also the operations/instructions of each step" part. — Andrei, Apr 07 '12 at 11:24
http://ideone.com/kGMUB this is the link,its not even giving correct ans perfectly.The algorithm in wikipedia is not also too easy to implement. — , Apr 07 '12 at 11:47

score 30 · Answer 1 · edited Nov 25 '19 at 16:53

30

The answer to this question is here: Parallel Prefix Sum (Scan) with CUDA and here: Prefix Sums and Their Applications. The NVidia article provides the best possible implementation using CUDA GPUs, and the Carnegie Mellon University PDF paper explains the algorithm. I also implemented an O(n/p) prefix sum using MPI, which you can find here: In my github repo.

This is the pseudocode for the generic algorithm (platform independent):

Example 3. The Up-Sweep (Reduce) Phase of a Work-Efficient Sum Scan Algorithm (After Blelloch 1990)

 for d = 0 to log2(n) – 1 do 
      for all k = 0 to n – 1 by 2^(d+1) in parallel do 
           x[k +  2^(d+1) – 1] = x[k +  2^d  – 1] + x[k +  2^(d+1) – 1]

Example 4. The Down-Sweep Phase of a Work-Efficient Parallel Sum Scan Algorithm (After Blelloch 1990)

 x[n – 1] = 0
 for d = log2(n) – 1 down to 0 do 
       for all k = 0 to n – 1 by 2^(d+1) in parallel do 
            t = x[k +  2^d  – 1]
            x[k +  2^d  – 1] = x[k +  2^(d+1) – 1]
            x[k +  2^(d+1) – 1] = t +  x[k +  2^(d+1) – 1]

Where x is the input data, n is the size of the input and d is the degree of parallelism (number of CPUs). This is a shared memory computation model, if it used distributed memory you'd need to add communication steps to that code, as I did in the provided Github example.

edited Nov 25 '19 at 16:53

fospathi

537
1
6
7

answered Oct 13 '12 at 14:51

Marcel Valdez Orozco

2,985
1
25
24

4

Great response, but `x[k + 2^d +1 – 1]` should be `x[k + 2^(d +1) – 1]` both here and in the CUDA article you linked to (I trust the diagram they give over the code they give -- I believe the subscript notation was mistakenly lost there). – j_random_hacker Apr 15 '13 at 16:45
4

You are correct. I checked Blelloch's article, it seems the NVidia article is incorrect. – Marcel Valdez Orozco Apr 19 '13 at 04:26
1

During the down-sweep phase of the above algorithm, assume n=10, for d=2 and k=8, the index k+2^d–1>n. It is also the case for k+2^(d+1)–1>n. This is leading to application core dump. How should we handle the case for n not in powers of 2? – Ramakrishnan Kannan Sep 04 '14 at 02:01
2

You need to fill it in with 0's so it is a power of 2. – Marcel Valdez Orozco Oct 30 '14 at 01:30
I have implemented a version for the case n not in powers of 2 without filling the rest with 0's. Just ask if you need it. But it uses a slightly different downsweep algorithm than the one from the GPU Gems 3 book. – DanceIgel Feb 16 '16 at 15:51
3

I kind of have a hardtime understanding what `for all k = 0 to n – 1 by 2^(d+1) in parallel do` exactly does. I get the `for all k = 0 to n – 1` part, but what is ment with `by 2^(d+1)`? I know `d` comes from the outher loop, so sure, I can calculate `2^(d+1)` but then what should I do with that, or what does the `by` keyword mean? It would be great if you could clarify that. Thanks! – wawa Mar 29 '17 at 10:15
The by keyword is your increment specification in the for loop - for (int k; k <= n-1; k+=2^(d+1)). In other lectures it's called the "stride" of moving from an element in the share / global array to another. Every step (d) you're "striding" across elements in the array at 2^(d+1). The overall for loop is saying that all threads / processors will do this traversal stride loop. – EdC Sep 12 '22 at 01:21

Kiran Lonikar · Answer 2 · 2014-11-16T05:07:28.007

I implemented only sum of all elements in an array (the up-sweep reduce part of Blelloch), not the full prefix sum using Aparapi (https://code.google.com/p/aparapi/) in java/opencl. Its available at https://github.com/klonikar/trial-aparapi/blob/master/src/trial/aparapi/Reducer.java and it's written for a general block size (called localBatchSize in code) instead of 2. I found that block size of 8 works best for my GPU.

While the implementation works (sum computation is correct), it performs much worse than sequential sum. On my core-i7 (8 core) CPU, sequential sum takes about 12ms for 8388608 (8MB) numbers, the parallelized execution on GPU (NVidia Quadro K2000M with 384 cores) takes about 100ms. I have even optimized to transfer only the final sum after the computation and not the entire array. Without this optimization, it takes 20ms more. The implementation seems to be according to algorithm described in answer by @marcel-valdez-orozco.

I also implemented other algorithms that compared open CL and threading vs. straight sequential execution, and what I found is that the Intel core i7 CPU already does some major improvements. You also need to look at you compilation options, if you want true sequential execution remove all optimizations from your compilation, and even then it is hard to beat sequential sum, the CPU is very efficient at computing sums, I do not know exactly what optimizations are done at runtime that make it so fast. — Marcel Valdez Orozco, May 02 '15 at 03:12

score -40 · Accepted Answer · answered Apr 07 '12 at 12:21

Following piece of code will do the job

void prfxSum()
{
    int *x=0;
    int *y=0;
    int sum=0;
    int num=0;
    int i=0;

    cout << "Enter the no. of elements in input array : ";
    cin >> num;

    x = new int[num];
    y = new int[num];

    while(i < num)
    {
        cout << "Enter element " << i+1 << " : ";
        cin >> x[i++];
    }

    cout << "Output array :- " << endl;
    i = 0;
    while(i < num)
    {
        sum += x[i];
        y[i] = sum;
        cout << y[i++] << ", ";
    }

    delete [] x;
    delete [] y;
}

Following is the output on execution

Enter the no. of elements in input array : 8
Enter element 1 : 1
Enter element 2 : 2
Enter element 3 : 3
Enter element 4 : 4
Enter element 5 : 5
Enter element 6 : 6
Enter element 7 : 7
Enter element 8 : 8
Output array :- 
1, 3, 6, 10, 15, 21, 28, 36

You can avoid the user input of 1000 elements of array x[] by feeding it from file or so.

How is this parallel at all? That's totally sequential. I don't know why OP accepted this as an answer. — Marcel Valdez Orozco, Oct 13 '12 at 07:30

Parallel prefix sum - fastest Implementation

3 Answers3

Linked