3

The goal I want to achieve is to find a good word_and_phrase embedding model that can do: (1) For the words and phrases that I am interested in, they have embeddings. (2) I can use embeddings to compare similarity between two things(could be word or phrase)

So far I have tried two paths:

1: Some Gensim-loaded pre-trained models, for instance:

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
# download the model and return as object ready for use
model_glove_twitter = api.load("fasttext-wiki-news-subwords-300")
model_glove_twitter.similarity('computer-science', 'machine-learning')

The problem with this path is that I do not know if a phrase has embedding. For this example, I got this error:

KeyError: "word 'computer-science' not in vocabulary"

I will have to try different pre-trained models, such as word2vec-google-news-300, glove-wiki-gigaword-300, glove-twitter-200, etc. Results are similar, there are always phrases of interests not having embeddings.

  1. Then I tried to use some BERT-based sentence embedding method: https://github.com/UKPLab/sentence-transformers.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

from scipy.spatial.distance import cosine

def cosine_similarity(embedding_1, embedding_2):
    # Calculate the cosine similarity of the two embeddings.
    sim = 1 - cosine(embedding_1, embedding_2)
    print('Cosine similarity: {:.2}'.format(sim))

phrase_1 = 'baby girl'
phrase_2 = 'annual report'
embedding_1 = model.encode(phrase_1)
embedding_2 = model.encode(phrase_2)
cosine_similarity(embedding_1[0], embedding_2[0])

Using this method I was able to get embeddings for my phrases, but the similarity score was 0.93, which did not seem to be reasonable.

So what can I try else to achieve the two goals mentioned above?

Trent
  • 53
  • 1
  • 7
  • Why this score is not reasonable ? With tokenization, words will be splitted into sub words which are referenced into embedding, so it will effectively resolve one of you problem. To embed the whole sentence, you may compute an average of each word embedding ? May read this article https://engineering.talkdesk.com/what-are-sentence-embeddings-and-why-are-they-useful-53ed370b3f35 – Alexy Sep 11 '20 at 09:11
  • Because 0.93 should be very similar phrases. I do not see 'baby girl' and 'annual report' are similar – Trent Sep 11 '20 at 09:14
  • How do you assess that 0.93 is not reasonable? Rather than looking at the raw similarity value, I think you should evaluate the embeddings on some end task, either the one you have in mind or using some classificatoin / matching dataset using only the embeddings provided by your embedder – Mathieu Sep 11 '20 at 09:14
  • Similarity comparison is pretty much my end task. Perhaps I should not have considered path 2 a good option? – Trent Sep 11 '20 at 09:17
  • 1
    Then I'd suggest scrapping from the internet many pairs(sentences or words) that should be similar or dissimilar and then check the distributions of the similarity output. Maybe matching pairs are 0.99 similar and 0.93 is actually discriminative – Mathieu Sep 11 '20 at 09:31
  • a note for path 1 : you might want to check if a tokenizer is not provided with your model, computer-science was probably splitted into 2 words rather than 1 – Mathieu Sep 11 '20 at 09:34
  • Why are you using [0] with embeddings when calling the cosine_similarity function? Can you try removing that? You may be just picking one dimension of the embedding vector instead of all the dimensions. – Adnan S Sep 14 '20 at 05:05

1 Answers1

3

The problem with the first path is that you are loading fastText embeddings like word2vec embeddings and word2vec can't cope with Out Of Vocabulary words.

The good thing is that fastText can manage OOV words. You can use Facebook original implementation (pip install fasttext) or Gensim implementation.

For example, using Facebook implementation, you can do:

import fasttext
import fasttext.util

# download an english model
fasttext.util.download_model('en', if_exists='ignore')  # English
model = fasttext.load_model('cc.en.300.bin')

# get word embeddings
# (if instead you want sentence embeddings, use get_sentence_vector method)
word_1='computer-science'
word_2='machine-learning'
embedding_1=model.get_word_vector(word_1)
embedding_2=model.get_word_vector(word_2)

# compare the embeddings
cosine_similarity(embedding_1, embedding_2)
  • Using this method, I am able to obtain embeddings for a lot more phrase. Just curious, does it handle OOV words using the sum of character ngram vectors like discussed here: https://stackoverflow.com/questions/50828314/how-does-the-gensim-fasttext-pre-trained-model-get-vectors-for-out-of-vocabulary/50828479#50828479? If so, why did I get a all zeros array embedding for 'blargfizzle'? – Trent Sep 15 '20 at 02:54
  • Yes, the linked answer is relevant. Using English cc.en.300.bin model, I get a valid word embedding for 'blargfizzle'. If my answer was useful please accept and vote it. – Stefano Fiorucci - anakin87 Sep 15 '20 at 07:54
  • tried both get_sentence_vector() and get_word_vector() again but still got a all zeros array embedding for 'blargfizzle'. Before I accept, would you please share your code? Many thanks! – Trent Sep 15 '20 at 08:39
  • I used the same code present in the answer: `model.get_word_vector('blargfizzle')` returns: `array([ 0.00646026, 0.02136607, 0.0110087 , 0.04861083, -0.04463948, ...])` – Stefano Fiorucci - anakin87 Sep 15 '20 at 08:53
  • That's odd, maybe we have different lib versions? My fasttext is 0.9.2 – Trent Sep 15 '20 at 09:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221502/discussion-between-trent-and-anakin87). – Trent Sep 15 '20 at 09:35