1

I want to compute cosine distance between each rows in a pandas dataframe. Before computing the distance, i want to select only elements in vectors which are > 0 and intersects (have values in both rows). For example, row1 [0,1,45,0,0] and row2 [4,11,2,0,0]. in this case, the program will only compute cosine distance between [1,45] and [11,2]. Here is my script, but this takes a long time to complete. Any help on simplifying the script and reducing processing time is appreciated.

data = df.values
m, k = data.shape
dist = np.zeros((m, m))
for i in range(m):
    for j in range(i,m):
        if i!=j:
            vec1 = data[i,:]
            vec2 = data[j,:]
            pairs = [(x, y) for (x, y) in zip(vec1, vec2) if x > 0 and y > 0]
            if pairs:
                sub_list_1, sub_list_2 = map(list, zip(*pairs))
                dist[i][j] = dist[j][i]=cosine(sub_list_1, sub_list_2)
            else:
                dist[i][j]= dist[j][i] =1
        else:
            dist[i][j]=0 
kitchenprinzessin
  • 1,023
  • 3
  • 14
  • 30

1 Answers1

1

From the cosine docs we have the following info -

scipy.spatial.distance.cosine(u, v) : Computes the Cosine distance between 1-D arrays.

The Cosine distance between u and v, is defined as

enter image description here

where u⋅v is the dot product of u and v.

Using the above formula, we would have one vectorized solution using NumPy's broadcasting, like so -

def self_cosine_vectorized(a):
    dots = a.dot(a.T)
    sqrt_sums = np.sqrt((a**2).sum(1))
    cosine_dists = 1 - (dots/sqrt_sums)/sqrt_sums[:,None]
    np.fill_diagonal(cosine_dists,0)
    return cosine_dists

Thus, to get dist -

dist = self_cosine_vectorized(df.values)  

Runtime test and verification

Original approach :

def original_app(data):
    m, k = data.shape
    dist = np.zeros((m, m))
    for i in range(m):
        for j in range(m):
            if i!=j:
                vec1 = data[i,:]
                vec2 = data[j,:]
                pairs = [(x, y) for (x, y) in zip(vec1, vec2) if x > 0 and y > 0]
                if pairs:
                    sub_list_1, sub_list_2 = map(list, zip(*pairs))
                    dist[i][j] = cosine(sub_list_1, sub_list_2)
                else:
                    dist[i][j]
            else:
                dist[i][j]=0 
    return dist

Timings and verification -

In [203]: data = np.random.rand(100,100)

In [204]: np.allclose(original_app(data), self_cosine_vectorized(data))
Out[204]: True

In [205]: %timeit original_app(data)
1 loops, best of 3: 813 ms per loop

In [206]: %timeit self_cosine_vectorized(data)
10000 loops, best of 3: 101 µs per loop

In [208]: 813000.0/101
Out[208]: 8049.504950495049

Crazy 8000x+ speedup there!

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • this calculates the distance between the 'whole' 2 vectors, but does not answer my question (For example, vector1 [0,1,45,0,0] and vector2 [4,11,2,0,0]. in this case, the program will only compute cosine distance between [1,45] and [11,2]. ) – kitchenprinzessin Mar 13 '17 at 11:47
  • @kitchenprinzessin So, the code that you have listed in the question isn't working in the first place? – Divakar Mar 13 '17 at 11:48
  • it is running forever, as i have a large dataframe ( 1877 x 6516). it works with a smaller dataframe though.. – kitchenprinzessin Mar 13 '17 at 11:49
  • @kitchenprinzessin Well I meant in theory is your code giving you the correct results that is if given some minimal sample data, would your code in the question give you the correct result? – Divakar Mar 13 '17 at 11:51
  • yes, given a smaller size of dataframe, the code produces correct results.. – kitchenprinzessin Mar 13 '17 at 11:56
  • @kitchenprinzessin Well then I am not sure where is the confusion. Added a sample large test data to verify against your working loopy version. – Divakar Mar 13 '17 at 12:09
  • @kitchenprinzessin Did the updates satisfy your needs? – Divakar Mar 14 '17 at 08:48
  • sorry, your solution did not answer the question. the issue is not on computing cosine distance for whole dataframe (solutions are available [here](http://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat)). As i specified in my first comment, i want to filter out zeroes from both vectors (based on their index) before computing their similarity/distance; see example above – kitchenprinzessin Mar 14 '17 at 09:50
  • @kitchenprinzessin Well in your first comment you said - `For example, vector1 [0,1,45,0,0] and vector2 [4,11,2,0,0]. in this case, the program will only compute cosine distance between [1,45] and [11,2].` So, if you are only filter out zeros, then you should have [4,11,2] and not [11,2], right? Also, I asked you earlier twice if your loopy code is working, to which you confirmed that it does work – Divakar Mar 14 '17 at 10:07
  • "filter out zeroes from both vectors (**based on their index**)" => vector1 has values > 0 at indexes 1,2 [1,45] and the values on vector2 on these indexes are [11,2]. – kitchenprinzessin Mar 14 '17 at 10:22