There is a cleaner way to do this with index_table_from_file
and Dataset API.
First, create your own tf.Dataset
(I assume we have two sentences with some arbitary labels):
sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
Second, create a vocab.txt
file that each line's number in this file maps to the same index in the Glove
embedding. For example, if the first vocabulary in Glove is "absent" in vocab.txt
the first line should "absent" and so on. For simplicity, assume our vocab.txt
contains the following words:
first
is
test
this
second
sentence
Then, based on here, define a table that its goal is to convert each word to specific id:
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))
dataset = dataset.batch(1)
Finally, based on this answer, by using nn.embedding_lookup()
convert each sentence to embedding:
glove_weights = tf.get_variable('embed', shape=embedding.shape, initializer=initializer=tf.constant_initializer(embedding), trainable=False)
iterator = dataset.make_initializable_iterator()
x, y = iterator.get_next()
embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)
Complete code in eager mode:
import tensorflow as tf
tf.enable_eager_execution()
sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))
dataset = dataset.batch(1)
glove_weights = tf.get_variable('embed', shape=(10000, 300), initializer=tf.truncated_normal_initializer())
for x, y in dataset:
embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)
print(sentence.shape)