Calculate cosine similarity between sets of document and key words (e.g. "innovate' "fast")

Question

I have a set of documents that describe different dimensions of corporate culture. Tokenized examples below:

sent1=['innovative','culture','fast','moving','company']
sent2=['manager','micromanage','all','time']
sent3=['slow','response','customer']

I've already applied Glove and Gensim w2v to the above documents. I'd like to identify documents that have high cosine similarity score to a sets of word, such as Innovation =['innovate','innovative','fast']

How do I calculate the cosine similarities between each document (e.g. sent1, sent2) and Innovation using Gensim?

Ideal Output:

       innovation
sent1  0.98
sent2  0.45
sent3  -0.2

score 0 · Answer 1 · answered Dec 16 '20 at 19:37

0

There are different methods when it comes to "cosine similarity between sets of documents". You can read some of the solutions here.

But if you want to calculate the CS between just two words, you can do this (were a and b are your vectors):

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))

answered Dec 16 '20 at 19:37

Peyman

3,097
5
33
56

Thanks, I'd like to calculate the similarity between two sentences. (e.g. sent1 and the list of "innovative" words0 – Yvonne Dec 16 '20 at 19:41
so your answer is on the link. @Yvonne – Peyman Dec 16 '20 at 19:42

Calculate cosine similarity between sets of document and key words (e.g. "innovate' "fast")

1 Answers1