What tool can I use in this situation...?
You could implement the euclidean distance function on your own using the approach suggested by @Saksow. Assuming that a
and b
are one-dimensional NumPy arrays, you could also use any of the methods proposed in this thread:
import numpy as np
np.linalg.norm(a-b)
np.sqrt(np.sum((a-b)**2))
np.sqrt(np.dot(a-b, a-b))
If you wish to compute in one go the pairwise distance (not necessarily the euclidean distance) between all the points in your
array, the module scipy.spatial.distance
is your friend.
Demo:
In [79]: from scipy.spatial.distance import squareform, pdist
In [80]: arr = np.asarray([[0, 0, 0],
...: [1, 0, 0],
...: [0, 2, 0],
...: [0, 0, 3]], dtype='float')
...:
In [81]: squareform(pdist(arr, 'euclidean'))
Out[81]:
array([[ 0. , 1. , 2. , 3. ],
[ 1. , 0. , 2.23606798, 3.16227766],
[ 2. , 2.23606798, 0. , 3.60555128],
[ 3. , 3.16227766, 3.60555128, 0. ]])
In [82]: squareform(pdist(arr, 'cityblock'))
Out[82]:
array([[ 0., 1., 2., 3.],
[ 1., 0., 3., 4.],
[ 2., 3., 0., 5.],
[ 3., 4., 5., 0.]])
Notice that the number of points in the mock data array used in this toy example is
and the resulting pairwise distance array has
elements.
...and what max possible size for this tools?
If you try to apply the approach above using your data (
) you get an error:
In [105]: data = np.random.random(size=(217000, 3))
In [106]: squareform(pdist(data, 'euclidean'))
Traceback (most recent call last):
File "<ipython-input-106-fd273331a6fe>", line 1, in <module>
squareform(pdist(data, 'euclidean'))
File "C:\Users\CPU 2353\Anaconda2\lib\site-packages\scipy\spatial\distance.py", line 1220, in pdist
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError
The issue is you are running out of RAM. To perform such computation you would need more than 350TB! The required amount of memory result from multiplying the number of elements of the distance matrix (2170002) by the number of bytes of each element of that matrix (8), and dividing this product by the apropriate factor (10243) to express the result in gigabytes:
In [107]: round(data.shape[0]**2 * data.dtype.itemsize / 1024.**3)
Out[107]: 350.8
So the maximum allowed size for your data is determined by the amount of available RAM (take a look at this thread for further details).