0

I have a set of documents that describe different dimensions of corporate culture. Tokenized examples below:

sent1=['innovative','culture','fast','moving','company']
sent2=['manager','micromanage','all','time']
sent3=['slow','response','customer']

I've already applied Glove and Gensim w2v to the above documents. I'd like to identify documents that have high cosine similarity score to a sets of word, such as Innovation =['innovate','innovative','fast']

How do I calculate the cosine similarities between each document (e.g. sent1, sent2) and Innovation using Gensim?

Ideal Output:

       innovation
sent1  0.98
sent2  0.45
sent3  -0.2
Yvonne
  • 81
  • 6

1 Answers1

0

There are different methods when it comes to "cosine similarity between sets of documents". You can read some of the solutions here.

But if you want to calculate the CS between just two words, you can do this (were a and b are your vectors):

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))
Peyman
  • 3,097
  • 5
  • 33
  • 56