I'd like to look up 3 integers (i.e. [1 2 3]) in a large data set of around a million points.
I'm currently using MATLAB's Map (hashmap), and for each point I'm doing the following:
key = sprintf('%d ', [1 2 3]); % 23 us
% key = '1 2 3 '
result = lookup_map( key ); % 32 us
This is quite time consuming though - 1 million points * 55 us = 55 seconds.
I'd like to move this to the GPU using CUDA, but I'm not sure of the best way of approaching this.
I could transfer four arrays - key1, key2, key3, result
, and then perform binary search on the keys, but this would take 20 iterations (2^20 = 1048576) per key. Then I'd also have delays due to concurrent memory access from each thread.
Is there a data structure optimised for parallel (O(1), ideally) multiple key lookups in CUDA?
Q: What are the bounds of the three integers? And what data is looked up?
The integer keys can be between 0 and ~75,000 currently, but may be bigger (200,000+) in the future.
For the purposes of this question, we can assume that the result
is an integer between 0 and the size of the data set.
Q: Why don't you pack all three numbers into one 64bit number (21 bits per number gives you a range of 0-2,097,152). And use that to index into a sparse array?
>> A = uint64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'uint64'.
>> A = int64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'int64'.
It appears that my matlab doesn't support sparse arrays of 64-bit numbers.
In case this helps anyone else, I wrote a quick function to create a 64-bit key from three <2^21 unsigned integers:
function [key] = to_key(face)
key = uint64(bitsll(face(1), 42) + bitsll(face(2), 21) + rand(face(3),1));
end
Q: From @Dennis - why not use logical indexing?
Let's test it!
% Generate a million random integers between 0 and 1000
>> M = int32(floor(rand(10000000,4)*1000));
% Find a point to look for
>> search = M(500000,1:3)
search =
850 910 581
>> tic; idx = M(:,1)==search(1) & M(:,2)==search(2)&M(:,3)==search(3); toc;
Elapsed time is 0.089801 seconds.
>> M(idx,:)
ans =
850 910 581 726
Unfortunately this takes 89801us, which is 1632x slower than my existing solution (55us)! It would take 2.5 hours to run this a million times!
We could try filtering M
after each search:
>> tic; idx1=M(:,1)==search(1); N=M(idx1,:); idx2=N(:,2)==search(2); N2=N(idx2,:); idx3 = N2(:,3)==search(3); toc;
Elapsed time is 0.038272 seconds.
This is a little faster, but still 696x slower than using Map.
I've been thinking about this some more, and I've decided to profile the speed of re-generating some of the data on the fly from a single key lookup - it might be faster than a 3 key lookup, given the potential problems with this approach.