3

I have about 10^4 points in 7 dimensional space. For a certain application, I need to make ~10^6 range queries on this input to find all the points that lie inside a given range. In this application, all the queries use the same range size. What is the appropriate data structure for this problem?

kd-tree seems apt, but for 7 dimensions and small output size it is almost linear in time complexity for queries. Another solution is range trees, but it seems too complex to construct for the small number of input in this application. Also, I don't see any of these structures using the fact that the range is constant size to their advantage. For example, if this were a 1D problem, the queries would all be asking for points lying inside a range of size 10 for example, at different places along the number line.

Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
  • Are the queries offline or do you need to perform them on demand? In the online case, I don't think the fact that they have the same size will give you anything. For the offline-case I'm not sure, you can at least strip one dimension off by using a sliding window. It's still probably not worthwhile – Niklas B. Apr 24 '14 at 03:00
  • @NiklasB. The queries are online, generated based on results from earlier range queries. – Ashwin Nanjappa Apr 24 '14 at 03:08
  • Oh and one more idea: If your points are not too dense, you might get away with partitioning them into buckets the size of your query. A query will then intersect at most 2^d of those buckets (which is 128 in your case). If the points are sparse enough, this might save some serious work. Of course then there's also the problem of locating the correct buckets for a query, but again depending on your data there might not be too many non-empty buckets and you can use a linear search. And of course you can look into location-based hashing, but I don't think that is very good if you want optimality – Niklas B. Apr 24 '14 at 03:10
  • @NiklasB. Thanks! That is a good idea. Samet suggests using such a uniform grid with cells of size equal to the range size in his Spatial Data Structures book. Let me check out location based hashing too. – Ashwin Nanjappa Apr 24 '14 at 03:15
  • Cover trees also are worth investigating if your data have smaller dimension than the space in which they are embedded. – David Eisenstat Apr 28 '14 at 00:01

1 Answers1

0

Well, this is a late answer, I dont know if it s usefull now. You can use Fixed Height FQTs (FHFQT or FHQT) [http://users.dcc.uchile.cl/~rbaeza/ftp/fqtrees.ps.gz] or Fixed Query Arrays (FQA) [http://www.dcc.uchile.cl/~gnavarro/ps/mtap01.pdf]. These structures work well in that kind of similarity searching. Beside that, you need to use a good pivot selection method like Incremental one to get good results. I assume that you are using the Euclidian distance and the distances histogram is a Gaussian distribution. Sorry for my english...

Andres
  • 1