3

I am working with two large datasets (coordinates of three dimensional points). The first dataset has about a million points (Data1), while the second one has fifty million points (Data2). I need to do pair count and range queries by comparing the two datasets with each other. Presently, I am using the scipy cKDTree (I've read it's fast).

from scipy.spatial import cKDTree
KDTree_Data1 = cKDTree(Data1)
KDTree_Data2 = cKDTree(Data2)
print "KDTree created"
Data2_Indx = KDTree_Data1.query_ball_tree(KDTree_Data2, r=170, p=2.0, eps=0)                          
print "Something"

When I run this script for test cases (small sizes of Data1 and Data2), it runs fine, but when I run it for the actual case (~1 mil for Data 1 and ~50 mil for Data2), I get the following output (and error message).

KDTree created
Killed: 9

I suspect the query_ball_tree requires more memory than what my computer has. I am working with a Mac with 16 GB (1867 MHz DDR3) Memory and 3.1 GHz (Intel Core i7) Processor.

Can anyone suggest an alternative to cKDTree (or query_ball_tree) that will require less memory? I will be thankful to have any helpful response.

Commoner
  • 1,678
  • 3
  • 19
  • 34
  • Hi. Did you get this sorted? I am facing the same issue. – ds_user Jul 03 '17 at 10:45
  • @ds_user: Yes, I did find the cause of the problem. The output of 'query_ball_tree' is a 'list of pairs'. In the context of the problem that I had specified, the list of pairs was too big in size for the memory of a standard computer. In the system that I was working on, a python list with a single number would occupy a memory of 80 bytes. The output of query_ball_tree contained many tens of millions of numbers, which required a memory a very big size. The computer that I was working on had a memory of 16 GB. The output of query_ball_tree surely needed a memory bigger than 16 GB. – Commoner Jul 04 '17 at 20:39
  • @ds_user: One way to sort this, is to break the input data array into smaller chunks and compare them with data of smaller chunks. One can then combine the outputs of the comparison of smaller chunks. This approach worked for me flawlessly. – Commoner Jul 04 '17 at 20:42
  • This post may be helpful: https://stackoverflow.com/a/71332351/13394817 – Ali_Sh Sep 30 '22 at 18:38

0 Answers0