I have a bunch of vectors of 300 dimensions each, as part of doc2vec embedding. Each vector is a representation of an article. My goal is to discard the duplicate articles. I was thinking of running DBSCAN clustering over my dataset, and then for each cluster label, discard the duplicates. I chose DBSCAN because it's hierarchical, and it can take cosine distance as metric, which I think makes more sense than euclidean distance for document similarity detection.
Issue is, if I choose cosine distance, then sklearn DBSCAN implementation doesn't allow ‘ball_tree’ and ‘kd_tree’ as the algorithm, hence I'm left with ‘brute’, which I'm guessing has O(n^2) complexity when run for all vectors.
If I don't go the clustering route, then I see two options. Calculate similarity using the most_similar
method for all vectors, either in vanilla doc2vec, or through annoy index. According to this doc vanilla doc2vec has linear complexity for similarity query, whereas annoy gives sub-linear time complexity. I'd like to know, which one would be the best choice given my use case? Is there a better approach I'm missing?