3

I am currently trying to compare the similarity of millions of documents. For a first test on a CPU I reduced them to around 50 characters each and try to get the ELMo Embedding for 10 of them at a time like this:

ELMO = "https://tfhub.dev/google/elmo/2"
for row in file:
    split = row.split(";", 1)
    if len(split) > 1:
        text = split[1].replace("\n", "")
            texts.append(text[:50])
    if i == 300:
        break
    if i % 10 == 0:
        elmo = hub.Module(ELMO, trainable=False)
                 executable = elmo(
                 texts,
                 signature="default",
                 as_dict=True)["elmo"]

    vectors = execute(executable)
    texts = []
    i += 1

However, even with this small example, after around 300 sentences (and not even saving the vectors) the program consumes up to 12GB of RAM. Is this a know issue (the other issues I found suggest something similar, but not quite that extreme) or did I make a mistake?

Daniel Töws
  • 347
  • 1
  • 5
  • 21
  • You are passing in a variable `sentences` but we cannot see where this is defined – Stewart_R Jun 07 '19 at 06:55
  • sorry, my bad. it should have been texts instead of sentences (in my code the elmo part is part of its own method, where the parameter is called senteces). I have edited it – Daniel Töws Jun 07 '19 at 07:26

1 Answers1

2

This is for TensorFlow 1.x without Eager mode, I suppose (or else the use of hub.Module would likely hit bigger problems).

In that programming model, you need to first express your computation in a TensorFlow graph, and then execute that graph repeatedly for each batch of data.

  • Constructing the module with hub.Module() and applying it to map an input tensor to an output tensor are both parts of graph building and should happen only once.

  • The loop over the input data should merely call session.run() to feed input and fetch output data from the fixed graph.

Fortunately, there is already a utility function to do all this for you:

import numpy as np
import tensorflow_hub as hub

# For demo use only. Extend to your actual I/O needs as you see fit.
inputs = (x for x in ["hello world", "quick brown fox"])

with hub.eval_function_for_module("https://tfhub.dev/google/elmo/2") as f:
  for pystr in inputs:
    batch_in = np.array([pystr])
    batch_out = f(batch_in)
    print(pystr, "--->", batch_out[0])

What this does for you in terms of raw TensorFlow is roughly this:

module = Module(ELMO_OR_WHATEVER)
tensor_in = tf.placeholder(tf.string, shape=[None])  # As befits `module`.
tensor_out = module(tensor_in)

# This kind of session handles init ops for you.
with tf.train.SingularMonitoredSession() as sess:
  for pystr in inputs:
    batch_in = np.array([pystr])
    batch_out = sess.run(tensor_out, feed_dict={tensor_in: batch_in}
    print(pystr, "--->", batch_out[0])

If your needs are too complex for with hub.eval_function_for_module ..., you could build out this more explicit example.

Notice how the hub.Module is neither constructed nor called in the loop.

PS: Tired of worrying about building graphs vs running sessions? Then TF2 and eager execution are for you. Check out https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb

arnoegw
  • 1,218
  • 6
  • 13
  • It worked, I still get an increase (from 1.5gb to 2gb), which I can not explain, but it seems much more managable. How does it work though? I thought python had its own form of "garbage collection" where if the object was not referenced anymore it would be deleted. Shouldn't this have happend? – Daniel Töws Jun 07 '19 at 11:27
  • My first answer was incomplete. Should be much more satisfactory now. – arnoegw Jun 07 '19 at 13:01
  • The example code with eval_function_for_module didn't work for me: I got the ````InternalError: Dst tensor is not initialized. [[{{node checkpoint_initializer_14}}]]```` – Yu Shen Sep 23 '19 at 02:36