Improving the Efficiency of Compact/Scatter in CUDA

Question

Summary:

Any ideas about how to further improve upon the basic scatter operation in CUDA? Especially if one knows it will only be used to compact a larger array into a smaller one? or why the below methods of vectorizing memory ops and shared memory didn't work? I feel like there may be something fundamental I am missing and any help would be appreciated.

EDIT 03/09/15: So I found this Parallel For All Blog post "Optimized Filtering with Warp-Aggregated Atomics". I had assumed atomics would be intrinsically slower for this purpose, however I was wrong - especially since I don't think I care about maintaining element order in the array during my simulation. I'll have to think about it some more and then implement it to see what happens!

EDIT 01/04/16: I realized I never wrote about my results. Unfortunately in that Parallel for All Blog post they compared the global atomic method for compact to the Thrust prefix-sum compact method, which is actually quite slow. CUB's Device::IF is much faster than Thrust's - as is the prefix-sum version I wrote using CUB's Device::Scan + custom code. The warp-aggregrate global atomic method is still faster by about 5-10%, but nowhere near the 3-4x faster I had been hoping for based on the results in the blog. I'm still using the prefix-sum method as while maintaining element order is not necessary, I prefer the consistency of the prefix-sum results and the advantage from the atomics is not very big. I still try various methods to improve compact, but so far only marginal improvements (2%) at best for dramatically increased code complexity.

Details:

I am writing a simulation in CUDA where I compact out elements I am no longer interested in simulating every 40-60 time steps. From profiling it seems that the scatter op takes up the most amount of time when compacting - more so than the filter kernel or the prefix sum. Right now I use a pretty basic scatter function:

    __global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
            int myID =  blockIdx.x*blockDim.x + threadIdx.x;
            for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
                 if(flag[id]){
                    new_freq[scan_Index[id]] = freq[id];
                 }
             } 
    }

freq_Index is the number of elements in the old array. The flag array is the result from the filter. Scan_ID is the result from the prefix sum on the flag array.

Attempts I've made to improve it are to read the flagged frequencies into shared memory first and then write from shared memory to global memory - the idea being that the writes to global memory would be more coalesced amongst the warps (e.g. instead of thread 0 writing to position 0 and thread 128 writing to position 1, thread 0 would write to 0 and thread 1 would write to 1). I also tried vectorizing the reads and the writes - instead of reading and writing floats/ints I read/wrote float4/int4 from the global arrays when possible, so four numbers at a time. This I thought might speed up the scatter by having fewer memory ops transferring larger amounts of memory. The "kitchen sink" code with both vectorized memory loads/stores and shared memory is below:

    const int compact_threads = 256;
    __global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
        int gID =  blockIdx.x*blockDim.x + threadIdx.x; //global ID
        int tID = threadIdx.x; //thread ID within block
        __shared__ float row[4*compact_threads];
        __shared__ int start_index[1];
        __shared__ int end_index[1];
        float4 myResult;
        int st_index;
        int4 myFlag;
        int4 index;
        for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
            if(tID == 0){
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                start_index[0] = index.x;
                st_index = index.x;
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[0] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
            }
            __syncthreads();
            if(tID > 0){
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                st_index = start_index[0];
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[index.x-st_index] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
                if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
            }
            __syncthreads();
            int count = end_index[0] - st_index;

            int rem = st_index & 0x3; //equivalent to modulo 4
            int offset = 0;
            if(rem){ offset = 4 - rem; }

            if(tID < offset && tID < count){
                new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
            }

            int tempID = 4*tID+offset;
            if((tempID+3) < count){
                reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
            }

            tempID = tID + offset + (count-offset)/4*4;
            if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
        }
        int id = gID + freq_Index/4 * 4; 
        if(id < freq_Index){
            if(flag[id]){
                new_freq[scan_Index[id]] = freq[id];
            }
        }
    }

Obviously it gets a bit more complicated. :) While the above kernel seems stable when there are hundreds of thousands of elements in the array, I've noticed a race condition when the array numbers in the tens of millions. I'm still trying to track the bug down.

But regardless, neither method (shared memory or vectorization) together or alone improved performance. I was especially surprised by the lack of benefit from vectorizing the memory ops. It had helped in other functions I had written, though now I am wondering if maybe it helped because it increased Instruction-Level-Parallelism in the calculation steps of those other functions rather than the fewer memory ops.

I am confused. When doing stream *compaction*, aren't you looking at a gather rather than a scatter operation? — njuffa, Mar 08 '15 at 08:53
@njuffa perhaps I am using the wrong terminology, but the procedure I am following for compacting an array is: filter > scan > scatter: the filter determines which elements of the array to keep, the prefix scan determines the new index of each kept element, and then one scatters the elements of the old array to the new array. I believe that this last step can also be done in a gather operation, but I thought scatter was more efficient. Is there a better way of doing the compaction? — dada_dave, Mar 08 '15 at 10:53
@njuffa http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html "The GPUs on which Horn implemented stream compaction in 2005 did not have scatter capability, so Horn instead substituted a sequence of gather steps to emulate scatter. To compact n elements required log n gather steps, and while these steps could be implemented in one fragment program, this "gather-search" operation was fairly expensive and required more memory operations. The addition of a native scatter in recent GPUs makes stream compaction considerably more efficient." — dada_dave, Mar 08 '15 at 10:59
Performance of this is data dependent. For example, the shared memory step will provide no benefit if the distance between flagged data elements in the original array is larger than the block dimension (and ignoring grid-dimension spacing due to grid-striding loop). I also think that the L2 cache, which is a write-back cache, will mitigate the behavior signfiicantly. By the time the lines are evicted, they are probably mostly coalesced going to main memory, even without the shared memory mod. You don't say what kind of device you are running on, nor provided a complete code somebody could test — Robert Crovella, Mar 08 '15 at 14:27
@cr_dave: The terminology as I know it is as follows: "gather" refers to non-contiguous loads. "scatter" refers to non-contiguous stores. "stream compaction" is an operation that converts a non-contiguous input stream into a contiguous output stream. Thus my question. For gather operations where the average distance between accesses is small, reading through the texture path, for example by use of LDG (see documentation), can often help. — njuffa, Mar 08 '15 at 14:57
@Robert Crovella Part of the problem is indeed that there is large variation amongst the blocks, some blocks, almost every elements is flagged (thus shared memory would be of little advantage). Many blocks have highly scattered elements, but very few with 0. It is for these I thought that shared memory might've gained some performance increase but it sounds like from what you describe about L2 cache performance, even for these I might not be gaining much if anything from trying to use shared memory. — dada_dave, Mar 09 '15 at 02:35
@Robert Crovella by *these I meant blocks with infrequent flagged elements would have gained the biggest speedup — dada_dave, Mar 09 '15 at 02:42
@Robert Crovella I am testing on two different GPUs - a 780M (mobile) card and K20. Unfortunately the full simulation code is quite long. I was more hoping for information about why the above might not be giving any speed up (as you did with the L2) and what other methods might be employed. One thing of course is that the shared memory allows for vectorized writes more easily, but that doesn't seem to have helped much either. In fact, a different function I re-wrote just now also got slower with vectorized writes (that's pretty much all it does). Interesting ... — dada_dave, Mar 09 '15 at 02:45
@njuffa I think I understand what you mean now - in terms of the information from arrays it is a gather op. However, in terms of threads it is a scatter - i.e. every thread loads contiguous information from flag (and could do so from freq and scan_Index), but not every thread writes to new_freq and when it is does so, the index it writes to is determined by scan_Index. Hence the usage of the term scatter - I've also seen it referred to as "pack" since it is a scatter op that accomplishes the gather more efficiently than a gather op. — dada_dave, Mar 09 '15 at 02:58

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.

//kernel phase 1
int myID =  blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int mask;
    unsigned int cnt=0;//;//

    for(int j = 0; j < 32; j++){
        int index = (warpID<<10)+(j<<5)+lnID;
        
        bool pred;
        if(index > true_length) pred = false;
        else pred = predicate(input[index]);
        mask = __ballot(pred); 

        if(lnID == 0) {
            flag[(warpID<<5)+j] = mask;
            cnt += __popc(mask);
        }
    }

    if(lnID == 0) counter[warpID] = cnt; //store sum
}

//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array

//kernel phase 3
int myID =  blockIdx.x*blockDim.x + threadIdx.x;

for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int predmask;
    unsigned int cnt;

    predmask = flag[(warpID<<5)+lnID];
    cnt = __popc(predmask);

    //parallel prefix sum
#pragma unroll
    for(int offset = 1; offset < 32; offset<<=1){
        unsigned int n = __shfl_up(cnt, offset);
        if(lnID >= offset) cnt += n;
    }

    unsigned int global_index = 0;
    if(warpID > 0) global_index = scan_Index[warpID - 1];

    for(int i = 0; i < 32; i++){
        unsigned int mask = __shfl(predmask, i); //broadcast from thread i
        unsigned int sub_group_index = 0;
        if(i > 0) sub_group_index = __shfl(cnt, i-1);
        if(mask & (1 << lnID)){
            compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID]; 
        }
    }
}

}

EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.

Improving the Efficiency of Compact/Scatter in CUDA

1 Answers1