Document similarity in Spacy vs Word2Vec

Question

I have a niche corpus of ~12k docs, and I want to test near-duplicate documents with similar meanings across it - think article about the same event covered by different news organisations.

I have tried gensim's Word2Vec, which gives me terrible similarity score(<0.3) even when the test document is within the corpus, and I have tried SpaCy, which gives me >5k documents with similarity > 0.9. I tested SpaCy's most similar documents, and it was mostly useless.

This is the relevant code.

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=40)
doc = preprocess(query)
vec_bow = dictionary.doc2bow(doc)
vec_lsi_tfidf = lsi[tfidf[vec_bow]] # convert the query to LSI space
index = similarities.Similarity(corpus = corpus, num_features = len(dictionary), output_prefix = "pqr")
sims = index[vec_lsi_tfidf] # perform a similarity query against the corpus
most_similar = sorted(list(enumerate(sims)), key = lambda x:x[1])

for mid in most_similar[-100:]:
    print(mid, file_list[mid[0]])

Using gensim I have found a decent approach, with some preprocessing, but the similarity score is still quite low. Has anyone faced such a problem, and are there are some resources or suggestions that could be useful?

score 14 · Accepted Answer · answered Apr 12 '18 at 17:09

I would post a comment but I don't have enough reputation! In NLP it is easy to get caught up in the methods and forget about the preprocessing.

1) Remove Stopwords/most frequent words

2) Merge word pairs - Look at SpaCy's documentation

i.e. "New York City" becomes its own unique token instead of "New", "York", "City"

https://spacy.io/usage/linguistic-features

3) Use Doc2Vec instead of Word2Vec (Since you are already using gensim, this shouldn't be too hard to figure out, they have their own implementation)

Then, once you have done all of these things, you will have document vectors, which will likely give you a better score. Also, keep in mind that the 12k docs that you have are a small amount of samples in the grand scheme of things.

Thanks for the reply. I have done 1 and 3, and 2 has just been bumped up on my todo :) — Kalpit, Apr 13 '18 at 08:20
No problem my friend! Can you accept the answer for now unless someone else comes along and gives a better answer please? — Nate Raw, Apr 13 '18 at 13:11

Document similarity in Spacy vs Word2Vec

1 Answers1

Linked