0

The code:

import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform

ids = ['1', '2', '3']
points=[(0,0), (1,1), (3,3)]
distances = pdist(np.array(points), metric='euclidean')
print(distances)
distance_matrix = squareform(distances)
print(distance_matrix)

prints:

[1.41421356 4.24264069 2.82842712]
[[0.         1.41421356 4.24264069]
 [1.41421356 0.         2.82842712]
 [4.24264069 2.82842712 0.        ]]

as expected

I want to turn this into a long format for writing in csv, as in

id1,id2,distance
1,1,0
1,2,1.41421356
1,3,4.24264069
2,1,1.41421356
2,2,0
2,3,2.82842712

etc - how should I go about it for maximum efficiency? Using pandas is an option

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361

2 Answers2

2

Use DataFrame contructor with stack:

df = pd.DataFrame(distance_matrix, index=ids, columns=ids).stack().reset_index()
df.columns=['id1','id2','distance']
print (df)
  id1 id2  distance
0   1   1  0.000000
1   1   2  1.414214
2   1   3  4.242641
3   2   1  1.414214
4   2   2  0.000000
5   2   3  2.828427
6   3   1  4.242641
7   3   2  2.828427
8   3   3  0.000000

Or DataFrame contructor with numpy.repeat, numpy.tile and ravel:

df = pd.DataFrame({'id1':np.repeat(ids, len(ids)), 
                   'id2':np.tile(ids, len(ids)),
                   'dist':distance_matrix.ravel()})
print (df)
  id1 id2      dist
0   1   1  0.000000
1   1   2  1.414214
2   1   3  4.242641
3   2   1  1.414214
4   2   2  0.000000
5   2   3  2.828427
6   3   1  4.242641
7   3   2  2.828427
8   3   3  0.000000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks - those index operations still escape me. Do you think I could somehow avoid the squareform (thought to be slow in general). The matrices have many elements. – Mr_and_Mrs_D May 22 '18 at 09:21
  • @Mr_and_Mrs_D - Do you think [this](https://stackoverflow.com/a/34418376) solution? – jezrael May 22 '18 at 09:23
  • 1
    Will have a look - there is also this: https://stackoverflow.com/a/35413316/281545 - but your solution will do for now - I need to time to see if I need to optimize :) – Mr_and_Mrs_D May 22 '18 at 09:31
0

I would suggest using indices_merged_arr_generic_using_cp -

Helper function -

import numpy as np
import functools

# https://stackoverflow.com/a/46135435/ by @unutbu
def indices_merged_arr_generic_using_cp(arr):
    """
    Based on cartesian_product
    http://stackoverflow.com/a/11146645/190597 (senderle)
    """
    shape = arr.shape
    arrays = [np.arange(s, dtype='int') for s in shape]
    broadcastable = np.ix_(*arrays)
    broadcasted = np.broadcast_arrays(*broadcastable)
    rows, cols = functools.reduce(np.multiply, broadcasted[0].shape), len(broadcasted)+1
    out = np.empty(rows * cols, dtype=arr.dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    out[start:] = arr.flatten()
    return out.reshape(cols, rows).T

Usage -

In [169]: out = indices_merged_arr_generic_using_cp(distance_matrix)

In [170]: np.savetxt('out.txt', out, fmt="%i,%i,%f")

In [171]: !cat out.txt
0,0,0.000000
0,1,1.414214
0,2,4.242641
1,0,1.414214
1,1,0.000000
1,2,2.828427
2,0,4.242641
2,1,2.828427
2,2,0.000000

To get distance_matrix we can also use Scipy's cdist : cdist(points, points). There's also eucl_dist package (disclaimer: I am its author) that contains various methods to compute euclidean distances that are much more efficient than SciPy's cdist, especially for large arrays.

Divakar
  • 218,885
  • 19
  • 262
  • 358