3

I am doing some text analysis right now and as part of it I need to get a matrix of Jaro distances between all of words in specific list (so pairwise distance matrix) like this one:

       │CHEESE CHORES GEESE  GLOVES
───────┼───────────────────────────
CHEESE │    0   0.222  0.177  0.444     
CHORES │0.222       0  0.422  0.333
GEESE  │0.177   0.422      0  0.300
GLOVES │0.444   0.333  0.300      0

So, I tried to construct it using numpy.fromfunction. Per documentation and examples it passes coordinates to the function, gets its results, constructs the matrix of results.

I tried the below approach:

from jellyfish import jaro_distance

def distance(i, j):
    return 1 - jaro_distance(feature_dict[i], feature_dict[j])

feature_dict = 'CHEESE CHORES GEESE GLOVES'.split()
distance_matrix = np.fromfunction(distance, shape=(len(feature_dict),len(feature_dict)))

Notice: jaro_distance just accepts 2 strings and returns a float.

And I got a error:

File "<pyshell#26>", line 4, in distance
    return 1 - jaro_distance(feature_dict[i], feature_dict[j])
TypeError: only integer arrays with one element can be converted to an index

I added print(i), print(j) into beginning of the function and I found that instead of real coordinates something odd is passed:

[[ 0.  0.  0.  0.]
 [ 1.  1.  1.  1.]
 [ 2.  2.  2.  2.]
 [ 3.  3.  3.  3.]]
[[ 0.  1.  2.  3.]
 [ 0.  1.  2.  3.]
 [ 0.  1.  2.  3.]
 [ 0.  1.  2.  3.]]

Why? The examples on numpy site clearly show that just two integers are passed, nothing else.

I tried to exactly reproduce their example using a lambda function, but I get exactly same error:

distance_matrix = np.fromfunction(lambda i, j: 1 - jaro_distance(feature_dict[i], feature_dict[j]), shape=(len(feature_dict),len(feature_dict)))

Any help is appreciated - I assume I misunderstood it somehow.

Maksim Khaitovich
  • 4,742
  • 7
  • 39
  • 70
  • 1
    Could you please turn this into a [complete example](http://stackoverflow.com/help/mcve)? What is `feature_dict`? What is the call signature of `jaro_distance()`? – ali_m Apr 22 '15 at 18:49
  • It is a complete example, I believe. Feature dict is generated as provided in code in question: feature_dict = 'CHEESE CHORES GEESE GLOVES'.split() – Maksim Khaitovich Apr 22 '15 at 18:55
  • jaro_distance just get's 2 strings and returns a float. It is not my function, it is provided by jellyfish – Maksim Khaitovich Apr 22 '15 at 18:57
  • 1
    You might look at this post: http://stackoverflow.com/questions/18702105/parameters-to-numpys-fromfunction (first answer) - NumPy's `fromfunction` documentation is somewhat misleading – xnx Apr 22 '15 at 19:13
  • Thanks @xnx, it holds the needed information. – Maksim Khaitovich Apr 22 '15 at 19:44

1 Answers1

1

As suggested by @xnx I have investigated the question and found out that fromfunc is not passing coordinates one by one, but actually passess all of indexies at the same time. Meaning that if shape of array would be (2,2) numpy will not perform f(0,0), f(0,1), f(1,0), f(1,1), but rather will perform:

f([[0., 0.], [1., 1.]], [[0., 1.], [0., 1.]])

But looks like my specific function could vectorized and will produce needed results. So the code to achieve the needed is below:

from jellyfish import jaro_distance
import numpy
def distance(i, j):
    return 1 - jaro_distance(feature_dict[i], feature_dict[j])

feature_dict = 'CHEESE CHORES GEESE GLOVES'.split()

funcProxy = np.vectorize(distance)

distance_matrix = np.fromfunction(funcProxy, shape=(len(feature_dict),len(feature_dict)))

And it works fine.

Community
  • 1
  • 1
Maksim Khaitovich
  • 4,742
  • 7
  • 39
  • 70