Obtaining sentence embedding by getting the mean of all its word embeddings in Tensorflow?

Question

Here is my code for splitting the input Tensor with type tf.string and extracting each of its word embeddings using a pre-trained GloVe model. However, I get unwarranted errors regarding the cond implementation. I wonder if there is a cleaner way to obtain embeddings for all words in a string tensor.

# Take out the words
target_words = tf.string_split([target_sentence], delimiter=" ")

# Tensorflow parallel while loop variable, condition and body
i = tf.constant(0, dtype=tf.int32)
cond = lambda self, i: tf.less(x=tf.cast(i, tf.int32), y=tf.cast(tf.shape(target_words)[0], tf.int32))
sentence_mean_embedding = tf.Variable([], trainable=False)

def body(i, sentence_mean_embedding):
    sentence_mean_embedding = tf.concat(1, tf.nn.embedding_lookup(params=tf_embedding, ids=tf.gather(target_words, i)))

    return sentence_mean_embedding

embedding_sentence = tf.reduce_mean(tf.while_loop(cond, body, [i, sentence_mean_embedding]))

Amir · Accepted Answer · 2018-12-31T21:36:20.013

There is a cleaner way to do this with index_table_from_file and Dataset API.

First, create your own tf.Dataset (I assume we have two sentences with some arbitary labels):

sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))

Second, create a vocab.txt file that each line's number in this file maps to the same index in the Glove embedding. For example, if the first vocabulary in Glove is "absent" in vocab.txt the first line should "absent" and so on. For simplicity, assume our vocab.txt contains the following words:

first
is
test
this
second
sentence

Then, based on here, define a table that its goal is to convert each word to specific id:

table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))

dataset = dataset.batch(1)

Finally, based on this answer, by using nn.embedding_lookup() convert each sentence to embedding:

glove_weights = tf.get_variable('embed', shape=embedding.shape, initializer=initializer=tf.constant_initializer(embedding), trainable=False)

iterator = dataset.make_initializable_iterator()
x, y = iterator.get_next()

embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)

Complete code in eager mode:

import tensorflow as tf

tf.enable_eager_execution()

sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])

dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))

dataset = dataset.batch(1)

glove_weights = tf.get_variable('embed', shape=(10000, 300), initializer=tf.truncated_normal_initializer())

for x, y in dataset:
    embedding = tf.nn.embedding_lookup(glove_weights, x)
    sentence = tf.reduce_mean(embedding, axis=1)
    print(sentence.shape)

ids: A Tensor with type int32 or int64 containing the ids to be looked up in params. — hexpheus, Jan 01 '19 at 10:21
I missed the fact that ids is a tensor of (list of) various ids. Hence I was trying to iterate through the sentence ids one by one. Passing a vectorized sentence using tf.string_split through ids would be enough. — hexpheus, Jan 01 '19 at 10:23

Obtaining sentence embedding by getting the mean of all its word embeddings in Tensorflow?

1 Answers1

Linked