2

I am playing with Tensorflow sequence to sequence translation model. I was wondering if I could import my own word2vec into this model? Rather than using its original 'dense representation' mentioned in the tutorial.

From my point of view, it looks TensorFlow is using One-Hot representation for seq2seq model. Firstly,for function tf.nn.seq2seq.embedding_attention_seq2seq the encoder's input is a tokenized symbol, e.g. 'a' would be '4' and 'dog' would be '15715' etc. and requires num_encoder_symbols. So I think it makes me provide the position of the word and the total number of words, then the function could represent the word in One-Hot representation. I am still learning the source code, but it hard to understand.

Could anyone give me an idea on above problem?

Hanyu Guo
  • 717
  • 1
  • 9
  • 18

2 Answers2

2

The seq2seq embedding_* functions indeed create embedding matrices very similar to those from word2vec. They are a variable named sth like this:

EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"

Knowing this, you can just modify this variable. I mean -- get your word2vec vectors in some format, say a text file. Assuming you have your vocabulary in model.vocab you can then assign the read vectors in a way illustrated by the snippet below (it's just a snippet, you'll have to change it to make it work, but I hope it shows the idea).

   vectors_variable = [v for v in tf.trainable_variables()
                        if EMBEDDING_KEY in v.name]
    if len(vectors_variable) != 1:
      print("Word vector variable not found or too many.")
      sys.exit(1)
    vectors_variable = vectors_variable[0]
    vectors = vectors_variable.eval()
    print("Setting word vectors from %s" % FLAGS.word_vector_file)
    with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
      # Lines have format: dog 0.045123 -0.61323 0.413667 ...
      for line in f:
        line_parts = line.split()
        # The first part is the word.
        word = line_parts[0]
        if word in model.vocab:
          # Remaining parts are components of the vector.
          word_vector = np.array(map(float, line_parts[1:]))
          if len(word_vector) != vec_size:
            print("Warn: Word '%s', Expecting vector size %d, found %d"
                     % (word, vec_size, len(word_vector)))
          else:
            vectors[model.vocab[word]] = word_vector
    # Assign the modified vectors to the vectors_variable in the graph.
    session.run([vectors_variable.initializer],
                {vectors_variable.initializer.inputs[1]: vectors})
Lukasz Kaiser
  • 2,186
  • 12
  • 4
  • Really helpful advice. But I was still wondering where the original embedded file stored? Why using a scope method like `with vs.variable_scope(scope or type(self).__name__):` and I think the most straightforward way would be load a certain embedded vector file from somewhere. Thanks again for your help:) – Hanyu Guo Mar 22 '16 at 20:14
  • What about "PAD", "EOS" and "GO" tags? Where do those go? – Christos Hadjinikolis Feb 15 '17 at 14:18
1

I guess with the scope style, which Matthew mentioned, you can get variable:

 with tf.variable_scope("embedding_attention_seq2seq"):
        with tf.variable_scope("RNN"):
            with tf.variable_scope("EmbeddingWrapper", reuse=True):
                  embedding = vs.get_variable("embedding", [shape], [trainable=])

Also, I would imagine you would want to inject embeddings into the decoder as well, the key (or scope) for it would be somthing like:

"embedding_attention_seq2seq/embedding_attention_decoder/embedding"


Thanks for your answer, Lukasz!

I was wondering, what exactly in the code snippet <b>model.vocab[word]</b> stands for? Just the position of the word in the vocabulary?

In this case wouldn't that be faster to iterate through the vocabulary and inject w2v vectors for the words that exist in w2v model.

Deepend
  • 4,057
  • 17
  • 60
  • 101
Vlad Kolesnyk
  • 81
  • 1
  • 7