A different approach to breaking up the problem is to to split the universe into smaller partitions of 1024 elements each, and compute just the size of the intersections in this part of the universe.
I'm not sure if I've described that well; basically you're computing
A[i,j] = sum((k in v[i]) * (k in w[j]) for k in the_universe)
where v
and w
are the two lists of sets, and k in S
is 1
if true and 0
otherwise. The point is to permute the indices so that k
is in the outer loop rather than the inner loop, although for efficiency you will have to work with many consecutive k
at once, rather than one at a time.
That is, you initialize the output matrix to all zeroes, and for each block of 1024 universe elements, you compute the sizes of the intersections and accumulate the results into the output matrix.
I choose 1024, because I imagine you'll have a data layout where that's probably the smallest size where you can still get the full memory bandwidth when reading from device memory, and all of the threads in warp work together. (adjust this appropriately if you know better than me, or you aren't using nVidia and whatever other GPUs you're using would work with something better)
Now that your elements are a reasonable size, you can now appeal to traditional linear algebra optimizations to compute this product. I would probably do the following:
Each warp is assigned a large number of rows of the output matrix. It reads the corresponding elements out of the second vector, and then iterates through the first vector, computing products.
You could have all of the warps operate independently, but it may be better to do the following:
- All of the warps in a block work together to load some number of elements from the first vector
- Each warp computes the intersections it can and writes the results to the output matrix
You could store the loaded elements in shared memory, but you might get better results holding them in registers. Each warp can only compute the intersections with the set elements its holding onto, and you but after doing so the warps can all rotate which warps are holding which elements.
If you do enough optimizations along these lines, you will probably reach the point where you are no longer memory bound, which means you might not have to go so far as to do the most complicated optimizations (e.g. the shared memory approach described above might already be enough).