I have a function to extract the pre trained embeddings from GloVe.txt
and load them as Kears Embedding Layer
weights but how can I do for the same for the given two files?
This accepted stackoverflow answer gave me a a feel that .vec
can be seen as .txt
and we might use the same technique to extract the fasttext.vec
which we use for glove.txt
. Is my understanding correct?
I went through a lot of blogs and stack answers to find what to do with the binary file? And I found in this stack answer that binary or .bin
file is the MODEL itself not the embeddings and you can convert the bin file to text file using Gensim
. I think it does something to save the embeddings and we can load the pre trained embeddings just like we load Glove
. Is my understanding correct?
Here is the code to do that. I want to know if I'm on the right path because I could not find a satisfactory answer to my question anywhere.
tokenizer.fit_on_texts(data) # tokenizer is Keras Tokenizer()
vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words
encoded_docs = tokenizer.texts_to_sequences(data) # data is lists of lists of sentences
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') # max_length is say 30
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) # this will load the binary Word2Vec model
model.save_word2vec_format('GoogleNews-vectors-negative300.txt', binary=False) # this will save the VECTORS in a text file. Can load it using the below function?
def load_embeddings(vocab_size,fitted_tokenizer,emb_file_path,emb_dim=300):
'''
It can load GloVe.txt for sure. But is it the right way to load paragram.txt, fasttext.vec and word2vec.bin if converted to .txt?
'''
embeddings_index = dict()
f = open(emb_file_path)
for line in f:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
embedding_matrix = zeros((vocab_size, emb_dim))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
return embedding_matrix
My question is that Can we load the .vec
file directly AND can we load the .bin
file as I have described above with the given load_embeddings()
function?