Does `mem_fence` provide consistency between work-groups?

Question

I am trying to implement the bounding-box calculation as described here. Long story short, I have a binary tree of bounding boxes. The leaf nodes are all filled in, and now it is time to calculate the internal nodes. In addition to the nodes (each defining the child/parent indices), there is a counter for each internal node.

Starting at each leaf node, the parent node is visited and its flag atomically incremented. If this is the first visit to the node, the thread exits (as only one child is guaranteed to have been initialized). If it is the second visit, then both children are initialized, its bounding box is calculated and we continue with that node's parents.

Is the mem_fence between reading the flag and reading the data of its children sufficient to guarantee the data in the children will be visible?

kernel void internalBounds(global struct Bound * const bounds,
                           global unsigned int * const flags,
                           const global struct Node * const nodes) {
    const unsigned int n = get_global_size(0);
    const size_t D = 3;
    const size_t leaf_start = n - 1;
    size_t node_idx = leaf_start + get_global_id(0);

    do {
        node_idx = nodes[node_idx].parent;

        write_mem_fence(CLK_GLOBAL_MEM_FENCE);    

        // Mark node as visited, both children initialized on second visit
        if (atomic_inc(&flags[node_idx]) < 1)
            break;

        read_mem_fence(CLK_GLOBAL_MEM_FENCE);

        const global unsigned int * child_idxs = nodes[node_idx].internal.children;
        for (size_t d = 0; d < D; d++) {
            bounds[node_idx].min[d] = min(bounds[child_idxs[0]].min[d],
                                          bounds[child_idxs[1]].min[d]);
            bounds[node_idx].max[d] = max(bounds[child_idxs[0]].max[d],
                                          bounds[child_idxs[1]].max[d]);
        }
    } while (node_idx != 0);
}

I am limited to OpenCL 1.2.

score 1 · Answer 1 · answered Nov 16 '16 at 15:07

1

No it doesn't. CLK_GLOBAL_MEM_FENCE only provides consistency within the work group when accessing global memory. There is no inter-workgroup synchronization in OpenCL 1.x

Try to use a single, large workgroup and iterate over the data. And/or start with some small trees that will fit inside a single work group.

answered Nov 16 '16 at 15:07

mfa

5,017
2
23
28

I'm not looking for synchronization - only memory consistency. I've slightly updated the fences in my question, perhaps it makes it clearer. After data is written to bounds, a write-fence ensures this is committed before the flag is incremented & tested. Then, a read-fence ensures the bounds are loaded after the flag is tested. Therefore, anything that sees the flag > 1 should see the previous writes to bounds, no? – kai Nov 16 '16 at 17:11
Possibly atomic_add will do what you are trying to do. You can read the current value and increment it the same time, and then take action based on the value you read. It is blocking too, so local/global writes are thread safe. https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/atomic_add.html – mfa Nov 16 '16 at 20:39

score 1 · Answer 2 · edited May 23 '17 at 10:32

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mem_fence.html

mem_fence(...) syncs mem-accesses for only single work-item. Even if all work-items have this line, they may not hit(and continue) it at the same time.

barrier(...) does synchronize for all work items in a work group and have them wait for the slowest one(that isa accessing the specified memory given as parameter), but only connected to its own work groups work items.(such as only 64 or 256 for amd-intel and maybe 1024 for nvidia) because an opencl device driver implementation may be designed to finish all wavefronts before loading new shards of wavefronts because all global items would simply not fit inside chip memory(such as 64M work items each using 1kB local memory that need 64GB memory! --> even software emulation would need hundreds or thousands of passes and decrease performance to a level of single core cpu)

Global sync (where all work groups synchronized) is not possible.

Just in case work item work group and processing elements get mixed meanings, OpenCL: Work items, Processing elements, NDRange

Atomic function you put there is already accesing global memory so adding group-scope synchronization shouldn't be important.

Also check machine codes if

bounds[child_idxs[0]].min[d]

is getting whole bounds[child_idxs[0]] struct into private memory before accessing to min[d]. If yes, you can separate min as an independent array access its items to have %100 more memory bandwidth for it.

Test on intel hd 400, more than 100000 threads

            __kernel void fenceTest( __global float *c,
                                        __global int *ctr)
                        {
                           int id=get_global_id(0);
                           if(id<128000)
                           for(int i=0;i<20000;i++)
                           {  
                                c[id]+=ctr[0];
                                mem_fence(CLK_GLOBAL_MEM_FENCE); 
                            }
                            ctr[0]++;
                        }

2900ms (c array has garbage)

            __kernel void fenceTest( __global float *c, 
                                        __global int *ctr)
                        {
                       int id=get_global_id(0);
                       if(id<128000)
                       for(int i=0;i<20000;i++)
                       {  
                            c[id]+=ctr[0];

                        }
                        ctr[0]++;
                        }

500 ms(c array has garbage). 500ms is ~6x the performance of fence version(my laptop has single channel 4GB ram which is only 5-10 GB/s but its igpu local memory has nearly 38GB/s(64B per cycle and 600 MHz frequency)). Local fence version takes 700ms so the fenceless version doesn't even touching cache or local memory for some iterations as it seems.

Without loop, it takes 8-9 ms so it wasn't optimizing the loop in these kernels I suppose.

Edit:

                            int id=get_global_id(0);
                            if(id==0)
                            {
                                atom_inc(&ctr[0]);
                                mem_fence(CLK_GLOBAL_MEM_FENCE); 

                            }
                            mem_fence(CLK_GLOBAL_MEM_FENCE);
                            c[id]+=ctr[0];

behaves exactly as

                            int id=get_global_id(0);
                            if(id==0)
                            {
                                ctr[0]++;
                                mem_fence(CLK_GLOBAL_MEM_FENCE); 

                            }
                            mem_fence(CLK_GLOBAL_MEM_FENCE);
                            c[id]+=ctr[0];

for this Intel igpu device(only by chance, but it proves changed memory is visible by "all" trailing threads, but doesn't prove it always happens(such as first compute unit hiccups and 2nd starts first for example) and it is not atomic for more than single threads accessing it).

I know global *synchronization* is not possible. I am not expecting work-items to wait on the fence. The question is about memory consistency. After data is written to `bounds`, a write-fence ensures this is committed before the flag is incremented & tested. Then, a read-fence ensures the bounds are loaded after the flag is tested. Therefore, anything that sees the `flag > 1` should see the previous writes to `bounds`, no? — kai, Nov 16 '16 at 17:02
@kai it flushes memory operations as it is its purpose so just after that, no pending mem operation exists for that item. which should enable other workitems access that mem cell without any other issue — huseyin tugrul buyukisik, Nov 16 '16 at 17:03
I accidentally pressed save in the middle of writing my comment - exactly, if the memory has been flushed according to the fences, it should be consistent even between work-groups (if and only if the test on the atomic passed!) — kai, Nov 16 '16 at 17:04
Again, this is just my understanding so if you can explain what I'm missing that would be great. — kai, Nov 16 '16 at 17:05
@kai but you can't know which work-group is spawned first and which finished first. My logic says it should increase gradually but there is possibility of 3rd work group spawned before 2nd. or atleast issued before 2nd maybe. But all work groups "started" after that mem_fence by that work item, should be able to see that mem cell, — huseyin tugrul buyukisik, Nov 16 '16 at 17:06
I don't understand how the order in which work-groups spawn will affect it? If I visit a node twice while traversing my tree from the leaves to the root, I must have come once from the left child, and once from the right child. I have one work-item per leaf node. — kai, Nov 16 '16 at 17:13
If you flush to global mem, it has to be global mem, not cache. Others must be able to see it. Even if it is cache, other again can see it as they all use same cache(L2). Just like locking on a memory object in C#. Thats why it makes program slower. Only local memory barrier must be not able to do this. I'm trying some code to test it. — huseyin tugrul buyukisik, Nov 16 '16 at 17:28
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/128295/discussion-between-huseyin-tugrul-buyukisik-and-kai). — huseyin tugrul buyukisik, Nov 16 '16 at 20:44

Does `mem_fence` provide consistency between work-groups?

2 Answers2