I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations.
I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless.
This is the relevant code.
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])
for mid in most_similar[-100:]:
print(mid, file_list[mid[0]])
Using gensim I have found a decent approach, with some preprocessing, but the similarity score is still quite low. Has anyone faced such a problem, and are there are some resources or suggestions that could be useful?