1

I have a bunch of vectors of 300 dimensions each, as part of doc2vec embedding. Each vector is a representation of an article. My goal is to discard the duplicate articles. I was thinking of running DBSCAN clustering over my dataset, and then for each cluster label, discard the duplicates. I chose DBSCAN because it's hierarchical, and it can take cosine distance as metric, which I think makes more sense than euclidean distance for document similarity detection.

Issue is, if I choose cosine distance, then sklearn DBSCAN implementation doesn't allow ‘ball_tree’ and ‘kd_tree’ as the algorithm, hence I'm left with ‘brute’, which I'm guessing has O(n^2) complexity when run for all vectors.

If I don't go the clustering route, then I see two options. Calculate similarity using the most_similar method for all vectors, either in vanilla doc2vec, or through annoy index. According to this doc vanilla doc2vec has linear complexity for similarity query, whereas annoy gives sub-linear time complexity. I'd like to know, which one would be the best choice given my use case? Is there a better approach I'm missing?

Bitswazsky
  • 4,242
  • 3
  • 29
  • 58
  • 2
    The best way to find for sure which approach is best for your use case, or even practical given your data/resources/goals, is to try a bunch of alternatives & see which works best. Note that if vectors are scaled to all be of unit-length, euclidean distance & cosine-distance will always report the same rank-ordered nearest-neighbors, so the choice may not as stark as your suggest. When compared on some representative test set, which does better? – gojomo Jul 10 '20 at 17:20
  • @gojomo I went ahead with annoy indexes in the end. Seemed to serve my purpose well. – Bitswazsky Jul 20 '20 at 05:33

0 Answers0