I am trying to use pre-trained word embeddings taking into account phrases. Popular pre-trained embeddings like GoogleNews-vectors-negative300.bin.gz
have separate embeddings for phrases as well as unigrams e.g., embeddings for New_York
and the two unigrams New
and York
. Naive word tokenization and dictionary look-up ignore the bigram embedding.
Gensim provides a nice Phrase model, where given a text sequence it can learn compact phrases e.g., New_York
instead of two unigrams New
and York
. This is done by aggregating and comparing count statistics between the unigrams and the bigram. 1. Is it possible to use Phrase
with pre-trained embeddings without estimating the count statistics elsewhere?
- Is it possible to use
Phrase
with pre-trained embeddings without estimating the count statistics elsewhere? - If not, is there an efficient way to use these bigrams? I can imagine a way using a loop, but I believe it is ugly (Below).
Here is the ugly code.
from ntlk import word_tokenize
last_added = False
sentence = 'I love New York.'
tokens = ["<s>"]+ word_tokenize(sentence) +"<\s>"]
vectors = []
for index, token in enumerate(tokens):
if last_added:
last_added=False
continue
if "%s_%s"%(tokens[index-1], token) in model:
vectors.append("%s_%s"%(tokens[index-1], token))
last_added = True
else:
vectors.append(tokens[index-1])
lase_added = False