initialising Seq2seq embedding with pretrained word2vec

Question

I am interested in initialising tensorflow seq2seq implementation with pretrained word2vec.

I have seen the code. It seems embedding is initialized

with tf.variable_scope(scope or "embedding_attention_decoder"):
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [num_symbols, cell.input_size])

how do I change this to initialise with pretrained word2vec??

score 7 · Answer 1 · answered Nov 29 '15 at 02:47

I think you've gotten your answer in the mailing list but I am putting it here for posterity.

https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/bH6S98NpIJE

You can initialize it randomly and afterwards do: session.run(embedding.assign(my_word2vec_matrix))

This will override the init values.

This seems to work for me. I believe trainable=False is needed to keep the values fixed?

# load word2vec model (say from gensim)
model = load_model(FILENAME, binary=True)

# embedding matrix
X = model.syn0
print(type(X)) # numpy.ndarray
print(X.shape) # (vocab_size, embedding_dim)

# start interactive session
sess = tf.InteractiveSession()

# set embeddings
embeddings = tf.Variable(tf.random_uniform(X.shape, minval=-0.1, maxval=0.1), trainable=False)

# initialize
sess.run(tf.initialize_all_variables())

# override inits
sess.run(embeddings.assign(X))

what about the index: model.index2word ? how do you pass that to tensorflow? — vgoklani, Jan 15 '17 at 02:37

score 0 · Answer 2 · answered Nov 27 '15 at 00:40

You can change the tokanizer present in tensorflow/models/rnn/translate/data_utils.py to use a pre-trained word2vec model for tokenizing. The lines 187-190 of data_utils.py:

if tokenizer:
    words = tokenizer(sentence)
else:
    words = basic_tokenizer(sentence)

use basic_tokenizer. You can write a tokenizer method that uses a pre-trained word2vec model for tokenizing the sentences.

initialising Seq2seq embedding with pretrained word2vec

2 Answers2

Linked