I want to compute cosine distance between each rows in a pandas dataframe. Before computing the distance, i want to select only elements in vectors which are > 0 and intersects (have values in both rows). For example, row1 [0,1,45,0,0] and row2 [4,11,2,0,0]. in this case, the program will only compute cosine distance between [1,45] and [11,2]. Here is my script, but this takes a long time to complete. Any help on simplifying the script and reducing processing time is appreciated.
data = df.values
m, k = data.shape
dist = np.zeros((m, m))
for i in range(m):
for j in range(i,m):
if i!=j:
vec1 = data[i,:]
vec2 = data[j,:]
pairs = [(x, y) for (x, y) in zip(vec1, vec2) if x > 0 and y > 0]
if pairs:
sub_list_1, sub_list_2 = map(list, zip(*pairs))
dist[i][j] = dist[j][i]=cosine(sub_list_1, sub_list_2)
else:
dist[i][j]= dist[j][i] =1
else:
dist[i][j]=0