2

I have a set of 30 000 documents represented by vectors of floats. All vectors have 100 elements. I can find similarity of two documents by comparing them using cosine measure between their vectors. The problem is that it takes to much time to find the most similar documents. Is there any algorithm which can help me with speeding up this?

EDIT

Now, my code just counts cosine similarity between first and all others vectors. It takes about 3 sec. I would like to speed it up ;) algorithm doesn't have to be accurate but should give similar results to full search.

Sum of elements of each vector is equal 1.

start = time.time()

first = allVectors[0]
for vec in allVectors[1:]:
    cosine_measure(vec[1:], first[1:])

print str(time.time() - start)
latata
  • 1,703
  • 5
  • 27
  • 57
  • 1
    may be you could show us what you did? or at least what approach you have taken – Samy Arous Apr 16 '14 at 17:14
  • Are the vectors normalized to unit length? – Daniel Brückner Apr 16 '14 at 17:19
  • Sum of elements are equal 1. But I can normalize it to the length of 1 if it helps. – latata Apr 16 '14 at 17:26
  • 1
    3 seconds for 3 000 000 values is pretty good I think. Using map and numpy, you might be able to win half a second. Compiling your code using pypy might get you another half second. – Samy Arous Apr 16 '14 at 17:27
  • I'm rather looking for making an algorithm less accurate but much faster. Python is just for testing, I will implement it in another language. – latata Apr 16 '14 at 17:30

3 Answers3

3

Would locality sensitive hashing (LHS) help? In case of LHS, the hashing function maps similar items near each other with a probability of your choice. It is claimed to be especially well-suited for high-dimensional similarity search / nearest neighbor search / near duplicate detection and it looks like to me that's exactly what you are trying to achieve.

See also How to understand Locality Sensitive Hashing?

Community
  • 1
  • 1
Ali
  • 56,466
  • 29
  • 168
  • 265
2

There is a paper How to Approximate the Inner-product: Fast Dynamic Algorithms for Euclidean Similarity describing how to perform a fast approximation of the inner product. If this is not good or fast enough, I suggest to build an index containing all your documents. A structure similar to a quadtree but based on a geodesic grid would probably work really well, see Indexing the Sphere with the Hierarchical Triangular Mesh.

UPDATE: I completely forgot that you are dealing with 100 dimensions. Indexing high dimensional data is notoriously hard and I am not sure how well indexing a sphere will generalize to 100 dimensions.

Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143
2

If your vectors are normalized, the cosine is related to the Euclidean distance: ||a - b||² = (a - b)² = ||a||² + ||b||² - 2 ||a|| ||b|| cos(t) = 1 + 1 - 2 cos(t). So you can recast your problem in terms of Euclidean nearest neighbors.

A nice approach if that of the kD trees, a spatial data structure that generalizes the binary search (http://en.wikipedia.org/wiki/K-d_tree). Anyway, kD trees are known to be inefficient in high dimensions (your case), so that the so-called best-bin-first-search is preferred (http://en.wikipedia.org/wiki/Best-bin-first_search).

  • 1
    k-d trees are not suitable for efficiently finding the nearest neighbour in high-dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, N, should be N >> 2k. Quote taken from Wikipedia, so for 30,000 documents k-d trees are only effective to maybe 10 dimensions. – Daniel Brückner Apr 16 '14 at 17:57
  • @Daniel: Read my post in full. –  Apr 16 '14 at 18:03
  • I wrote the comment when the answer still stated "efficient". – Daniel Brückner Apr 16 '14 at 18:16
  • Yep, best-bin-first was mentioned. –  Apr 16 '14 at 19:09