3 integer key lookup in CUDA

Question

I'd like to look up 3 integers (i.e. [1 2 3]) in a large data set of around a million points.

I'm currently using MATLAB's Map (hashmap), and for each point I'm doing the following:

key = sprintf('%d ', [1 2 3]);       % 23 us
% key = '1 2 3 '
result = lookup_map( key );          % 32 us

This is quite time consuming though - 1 million points * 55 us = 55 seconds.

I'd like to move this to the GPU using CUDA, but I'm not sure of the best way of approaching this.

I could transfer four arrays - key1, key2, key3, result, and then perform binary search on the keys, but this would take 20 iterations (2^20 = 1048576) per key. Then I'd also have delays due to concurrent memory access from each thread.

Is there a data structure optimised for parallel (O(1), ideally) multiple key lookups in CUDA?

Q: What are the bounds of the three integers? And what data is looked up?

The integer keys can be between 0 and ~75,000 currently, but may be bigger (200,000+) in the future.

For the purposes of this question, we can assume that the result is an integer between 0 and the size of the data set.

Q: Why don't you pack all three numbers into one 64bit number (21 bits per number gives you a range of 0-2,097,152). And use that to index into a sparse array?

>> A = uint64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'uint64'.

>> A = int64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'int64'.

It appears that my matlab doesn't support sparse arrays of 64-bit numbers.

In case this helps anyone else, I wrote a quick function to create a 64-bit key from three <2^21 unsigned integers:

function [key] = to_key(face)
    key = uint64(bitsll(face(1), 42) + bitsll(face(2), 21) + rand(face(3),1));
end

Q: From @Dennis - why not use logical indexing?

Let's test it!

% Generate a million random integers between 0 and 1000
>> M = int32(floor(rand(10000000,4)*1000));
% Find a point to look for
>> search =  M(500000,1:3)             
search =
         850         910         581  
>> tic; idx = M(:,1)==search(1) & M(:,2)==search(2)&M(:,3)==search(3); toc;
Elapsed time is 0.089801 seconds.
>> M(idx,:)
ans =
         850         910         581         726

Unfortunately this takes 89801us, which is 1632x slower than my existing solution (55us)! It would take 2.5 hours to run this a million times!

We could try filtering M after each search:

>> tic; idx1=M(:,1)==search(1); N=M(idx1,:); idx2=N(:,2)==search(2); N2=N(idx2,:); idx3 = N2(:,3)==search(3); toc;
Elapsed time is 0.038272 seconds.

This is a little faster, but still 696x slower than using Map.

I've been thinking about this some more, and I've decided to profile the speed of re-generating some of the data on the fly from a single key lookup - it might be faster than a 3 key lookup, given the potential problems with this approach.

As a sidenote, I wish the NVIDIA forums were still accessible - there was a lot of useful info there. — Alex L, Oct 15 '12 at 07:56
What are the bounds of the three integers? And what data is looked up? — Skyler Saleh, Oct 15 '12 at 08:16
@SkylerSaleh Thanks for your interest - I've added some more information to my question. — Alex L, Oct 15 '12 at 08:24
Why don't you pack all three numbers into one 64bit number (21 bits per number gives you a range of 0-2,097,152). And use that to index into a sparse array? — Skyler Saleh, Oct 15 '12 at 08:58
@SkylerSaleh I've updated my Q - it appears that MATLAB (at least R2011a) doesn't support 64-bit sparse matrixes. — Alex L, Oct 15 '12 at 09:26
If you find the first key it will take about 20 iterations in the worst case, however if you filter on the result of this search, the remaining 2 keys should be found a lot quicker. — Dennis Jaheruddin, Oct 17 '12 at 20:28

score 2 · Accepted Answer · edited May 23 '17 at 12:04

2

I'm guessing this question is related to your previous question about tetrahedron faces. I still suggest you should try the sparse storage and sparse matrix-vector multiplication for that purpose:

size(spA)
ans =

 1244810     1244810

tic;
vv = spA*v;
idx = find(vv);
toc;

Elapsed time is 0.106581 seconds.

This is just timing analysis, see my previous answer about how to implement it in your case. Before you move to CUDA and do complicated stuff, check out simpler options.

edited May 23 '17 at 12:04

Community

1
1

answered Oct 15 '12 at 08:28

angainor

11,760
2
36
56

Haha I've got a follower! The reason I moved away from using sparse matrices is that I go over the maximum integer size for my computer (2.1475e+009). If I've got a 75,000 tetra mesh, and therefore 75,000*4 faces, `v=sparse(ntetras*nfaces, 1)` throws the error "Sparse matrix sizes must be non-negative integers less than MAXSIZE as defined by COMPUTER" – Alex L Oct 15 '12 at 08:52
Am I correct in thinking that if `ntetras*nfaces<=2.1475e+009`, and `nfaces=4*ntetras`, then `ntetras <= sqrt(2.1475e+009/4)` ~ 2317? – Alex L Oct 15 '12 at 08:54
@AlexL Unrelated to your problem size. Matlab `sparse` has `long` indices and can handle systems with size of 2^64.. And anyway, 75000*4=300000 ... that's a small problem. Probably you have other issues. – angainor Oct 15 '12 at 08:56
I'm sorry, I think I misunderstood your previous answer. So `v` is used just for the lookup, not to store all the data? – Alex L Oct 15 '12 at 09:04
@AlexL Yes, that is the indices you search for. The sparse matrix is also not very large. O(number of tetras * number of faces per tetra). – angainor Oct 15 '12 at 09:06

score 0 · Answer 2 · edited May 13 '13 at 06:24

0

Given the attention this question has already received it feels like this answer is too simple, but why don't you just do it like this:

M=[1:6; 2:7; 3:8; 4:9]'; %Some matrix that contains key 1 2 3, corresponding value is 4
idx=M(:,1)==1&M(:,2)==2&M(:,3)==3;
M(idx,4)

This should evaluate quite fast, even if M is 1 million x 4.

edited May 13 '13 at 06:24

Shai

111,146
38
238
371

answered Oct 17 '12 at 20:47

Dennis Jaheruddin

21,208
8
66
122

Cheers for your answer - I've tested it, and it runs ~1600 times slower than my existing solution. I've updated my question with my results, if you're interested! – Alex L Oct 18 '12 at 08:14
1

The problem is that `M(:,1)==1` first creates a new array of just the first column, and then compares every value of this new array (not stopping when the result is found). This means that your method will loop 3 million times, plus have to create three new arrays! – Alex L Oct 18 '12 at 08:17
Sorry, it appears I misunderstood your question. I assumed you wanted to find 1 point out of a million. Now i would like to ask whether you want to find 1 million points (that can occur multiple times?) Or for example each point exactly once? In that case i would consider some kind of sorting. – Dennis Jaheruddin Oct 18 '12 at 10:02

3 integer key lookup in CUDA

2 Answers2