I am doing some text analysis right now and as part of it I need to get a matrix of Jaro distances between all of words in specific list (so pairwise distance matrix) like this one:
│CHEESE CHORES GEESE GLOVES
───────┼───────────────────────────
CHEESE │ 0 0.222 0.177 0.444
CHORES │0.222 0 0.422 0.333
GEESE │0.177 0.422 0 0.300
GLOVES │0.444 0.333 0.300 0
So, I tried to construct it using numpy.fromfunction
. Per documentation and examples it passes coordinates to the function, gets its results, constructs the matrix of results.
I tried the below approach:
from jellyfish import jaro_distance
def distance(i, j):
return 1 - jaro_distance(feature_dict[i], feature_dict[j])
feature_dict = 'CHEESE CHORES GEESE GLOVES'.split()
distance_matrix = np.fromfunction(distance, shape=(len(feature_dict),len(feature_dict)))
Notice: jaro_distance just accepts 2 strings and returns a float.
And I got a error:
File "<pyshell#26>", line 4, in distance
return 1 - jaro_distance(feature_dict[i], feature_dict[j])
TypeError: only integer arrays with one element can be converted to an index
I added print(i)
, print(j)
into beginning of the function and I found that instead of real coordinates something odd is passed:
[[ 0. 0. 0. 0.]
[ 1. 1. 1. 1.]
[ 2. 2. 2. 2.]
[ 3. 3. 3. 3.]]
[[ 0. 1. 2. 3.]
[ 0. 1. 2. 3.]
[ 0. 1. 2. 3.]
[ 0. 1. 2. 3.]]
Why? The examples on numpy site clearly show that just two integers are passed, nothing else.
I tried to exactly reproduce their example using a lambda
function, but I get exactly same error:
distance_matrix = np.fromfunction(lambda i, j: 1 - jaro_distance(feature_dict[i], feature_dict[j]), shape=(len(feature_dict),len(feature_dict)))
Any help is appreciated - I assume I misunderstood it somehow.