5

I am trying to use pre-trained word embeddings taking into account phrases. Popular pre-trained embeddings like GoogleNews-vectors-negative300.bin.gz have separate embeddings for phrases as well as unigrams e.g., embeddings for New_York and the two unigrams New and York. Naive word tokenization and dictionary look-up ignore the bigram embedding.

Gensim provides a nice Phrase model, where given a text sequence it can learn compact phrases e.g., New_York instead of two unigrams New and York. This is done by aggregating and comparing count statistics between the unigrams and the bigram. 1. Is it possible to use Phrase with pre-trained embeddings without estimating the count statistics elsewhere?

  1. Is it possible to use Phrase with pre-trained embeddings without estimating the count statistics elsewhere?
  2. If not, is there an efficient way to use these bigrams? I can imagine a way using a loop, but I believe it is ugly (Below).

Here is the ugly code.

from ntlk import word_tokenize
last_added = False
sentence = 'I love New York.'
tokens =  ["<s>"]+ word_tokenize(sentence) +"<\s>"]
vectors = []
for index, token in enumerate(tokens):
    if last_added:
        last_added=False
        continue
    if "%s_%s"%(tokens[index-1], token) in model:
        vectors.append("%s_%s"%(tokens[index-1], token))
        last_added = True
    else:
        vectors.append(tokens[index-1])
        lase_added = False
geompalik
  • 1,582
  • 11
  • 22

0 Answers0