2

I have an array n×m, where n = 217000 and m = 3 (some data from telescope).

I need to calculate the distances between 2 points in 3D (according to my x, y, z coordinates in columns).

When I try to use sklearn tools the result is:

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

What tool can I use in this situation and what max possible size for this tools?

Tonechas
  • 13,398
  • 16
  • 46
  • 80
Kate Huseva
  • 21
  • 1
  • 2
  • 1
    Do you need to calculate the distance between **only 2 points** (i.e. point nr 5 and point nr 214987) or between **all points** (ie point nr 1 and point nr 2, then point nr 1 and point nr 3, ....)? – Hannes Ovrén Jan 18 '17 at 15:38
  • What are the array and item sizes? – martineau Jan 18 '17 at 16:13

2 Answers2

5

What tool can I use in this situation...?

You could implement the euclidean distance function on your own using the approach suggested by @Saksow. Assuming that a and b are one-dimensional NumPy arrays, you could also use any of the methods proposed in this thread:

import numpy as np
np.linalg.norm(a-b)
np.sqrt(np.sum((a-b)**2))
np.sqrt(np.dot(a-b, a-b))

If you wish to compute in one go the pairwise distance (not necessarily the euclidean distance) between all the points in your n*m array, the module scipy.spatial.distance is your friend.

Demo:

In [79]: from scipy.spatial.distance import squareform, pdist

In [80]: arr = np.asarray([[0, 0, 0],
    ...:                   [1, 0, 0],
    ...:                   [0, 2, 0],
    ...:                   [0, 0, 3]], dtype='float')
    ...: 

In [81]: squareform(pdist(arr, 'euclidean'))
Out[81]: 
array([[ 0.        ,  1.        ,  2.        ,  3.        ],
       [ 1.        ,  0.        ,  2.23606798,  3.16227766],
       [ 2.        ,  2.23606798,  0.        ,  3.60555128],
       [ 3.        ,  3.16227766,  3.60555128,  0.        ]])

In [82]: squareform(pdist(arr, 'cityblock'))
Out[82]: 
array([[ 0.,  1.,  2.,  3.],
       [ 1.,  0.,  3.,  4.],
       [ 2.,  3.,  0.,  5.],
       [ 3.,  4.,  5.,  0.]])

Notice that the number of points in the mock data array used in this toy example is n=4 and the resulting pairwise distance array has n^2=16 elements.

...and what max possible size for this tools?

If you try to apply the approach above using your data (n=217000) you get an error:

In [105]: data = np.random.random(size=(217000, 3))

In [106]: squareform(pdist(data, 'euclidean'))
Traceback (most recent call last):

  File "<ipython-input-106-fd273331a6fe>", line 1, in <module>
    squareform(pdist(data, 'euclidean'))

  File "C:\Users\CPU 2353\Anaconda2\lib\site-packages\scipy\spatial\distance.py", line 1220, in pdist
    dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)

MemoryError

The issue is you are running out of RAM. To perform such computation you would need more than 350TB! The required amount of memory result from multiplying the number of elements of the distance matrix (2170002) by the number of bytes of each element of that matrix (8), and dividing this product by the apropriate factor (10243) to express the result in gigabytes:

In [107]: round(data.shape[0]**2 * data.dtype.itemsize / 1024.**3)
Out[107]: 350.8

So the maximum allowed size for your data is determined by the amount of available RAM (take a look at this thread for further details).

Community
  • 1
  • 1
Tonechas
  • 13,398
  • 16
  • 46
  • 80
3

Using only Python and Euclidean distance formula for 3 dimensions:

import math
distance = math.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2 + (z1 - z2) ** 2)
Seif
  • 1,058
  • 11
  • 19