Is it possible to use KDTree with cosine similarity?

Question

Looks like I can't use this similarity metric for with sklearn KDTree, for example, but I need because I am using measuring words vectors similarity. What is fast robust customization algorithm for this case? I know about Local Sensitivity Hashing, but it should tunned & tested up a lot to find params.

score 8 · Accepted Answer · answered Dec 12 '16 at 18:30

8

The ranking your would get with cosine similarity is equivalent to the rank order of the euclidean distance when you normalize all the data points first. So you can use a KD tree to the the k nearest neighbors with KDTrees, but you will need to recompute what the cosine similarity is.

The cosine similarity is not a distance metric as normally presented, but it can be transformed into one. If done, you can then use other structures like Ball Trees to do accelerated nn with cosine similarity directly. I've implemented this in the JSAT library, if you were interested in a Java implementation.

answered Dec 12 '16 at 18:30

Raff.Edward

6,404
24
34

Did you observed any drawbacks vs "standart" distance metrics? I am a little afraide that the most popular ML libs don,t use cosine similarity or cos based distance for KD tree for some reason.. – Brans Dec 12 '16 at 19:44
1

KD Trees don't naturally support cosine similarity, that's why. KDTrees are specific to just a few distance metrics for which they are valid. They do not work with distance metrics in general. Other techniques, such as Ball and VP trees, can work with any valid distance metric. – Raff.Edward Dec 12 '16 at 20:19

score 4 · Answer 2 · answered Dec 12 '16 at 20:58

According to the table at the end of this page, cosine support eoth k-d-tree should be possible: ELKI supports cosine with the R-tree, and you can derive bounding rectangles for the k-d-tree, too; and the k-d-tree supports at least five metrics in that table. So I do not see why it shouldn't work. Indexing support in sklearn often is not very complete (albeit improving), unfortunately; so don't take that as a reference.

While the k-d-tree can theoretically support Cosine by

transforming the data such that Cosine becomes Euclidean distance
working with the bounding boxes and the minimum angle to the bounding box (that appears to be what ELKI is doing for the R-tree)

You should be aware that the k-d-tree does not work very well with high-dimensional data, and cosine is mostly popular for very high-dimensional data. A k-d-tree always only looks at one dimension. If you want all d dimension to be used once, you need O(2^d) data points. For high d, there is no way all attributes are used. The R-tree is slightly better here because it uses bounding boxes; these shrink with every split in all dimensions, so the pruning does get better. But this also means it needs a lot of memory for such data, and the tree construction may suffer from the same problem. So in essence, don't use either for high dimensional data.

But also don't assume that Cosine does magically improve your results, in particular for high-d data. It's very much overrated. As above transformation indicates, there cannot be a systematic benefit of Cosine over Euclidean: Cosine is a special case of Euclidean.

For sparse data, inverted lists (c.f. Lucene, Xapian, Solr, ...) are the way to index for cosine.

Is it possible to use KDTree with cosine similarity?

2 Answers2

Linked