swap_memory in dynamic_rnn allows quasi-infinite sequences?

Question

I am trying to tag letters in long char-sequences. The inherent structure of the data requires me to use a bidirectional approach.

Furthermore based on this idea I need access to the hidden state at each timestep, not just the final one.

To try the idea I used a fixed length approach. I currently use batches of random pieces of say 60 characters each out of my much longer sequences and run my handmade bidirectional classifier with zero_state being the initial_state for each 60-letters-piece.

This worked fine, but obviously not perfectly, as in reality the sequences are longer and the information left and right from the piece I randomly cut from the original source is lost.

Now in order to advance I want to work with the entire sequences. They heavily vary in length though and there is no way I'll fit the entire sequences (batched furthermore) onto the GPU.

I found the swap_memory - parameter in the dynamic_rnn documentation. Would that help?

I didn't find any further documentation that helped me understand. And I cannot really try this out myself easily because I need access to the hidden states at each timestep thus I coded the current graph without using any of the higher level wrappers (such as dynamic_rnn). Trying this out would require me to get all the intermediate states out of the wrapper which as I understand is a lot of work to implement.

Before going through the hassle of trying this out I would love to be sure that this would indeed solve my memory issue. Thx for any hints!

What kind of memory issues do you have? I have also experience with processing character sequences with Bi-LSTMs. In my experience, I finally chose fixed length sequences and used padding. Can you post a histogram of sequence lengths? — onur güngör, May 26 '17 at 06:48
Don't have a histogram. Inputs are char-based sequences of length as in emails and web pages. So the length ranges from a few characters to ten thousands — Phillip Bock, May 27 '17 at 12:04
A histogram and the number of samples etc. would help us to provide advice on whether it's worth the programming effort. Anyway, this recent study both implementation and paper might be very interesting for you if you want to dive deeper into general solutions for batching issues. It is not for Tensorflow but their graphs are very appealing. — onur güngör, May 27 '17 at 15:52
Sorry I forgot to paste the link: https://arxiv.org/abs/1705.07860 — onur güngör, May 27 '17 at 16:26

MWB · Accepted Answer · 2017-05-28T20:48:12.930

TL;DR: swap_memory won't let you work with pseudo-infinite sequences, but it will help you fit bigger (longer, or wider, or larger-batch) sequences in memory. There is a separate trick for pseudo-infinite sequences, but it only applies to unidirectional RNNs.

swap_memory

During training, a NN (including RNN) generally needs to save some activations in memory -- they are needed to calculate the gradient.

What swap_memory does is that it tells your RNN to store them in host (CPU) memory instead of the device (GPU) memory, and stream them back to the GPU by the time they are needed.

Effectively, this lets you pretend that your GPU has more memory than it actually does (at the expense of CPU memory, which tends to be more plentiful)

You still have to pay the computational cost of using very long sequences. Not to mention that you might run out of host memory.

To use it, simply give that argument the value True.

sequence_length

Use this parameter if your sequences are of different lengths. sequence_length has a misleading name - it's actually an array of sequence lengths.

You still need as much memory as you would have needed if all your sequences were of the same length (max_time parameter)

tf.nn.bidirectional_dynamic_rnn

TF includes a ready implementation of bidirectional RNNs, so it might be easier to use this instead of one's own.

Stateful RNNs

To deal with very long sequences when training unidirectional RNNs, people do something else: they save the final hidden states of every batch, and use them as the initial hidden state for the next batch (For this to work, the next batch has to be composed of the continuation of the previous batches' sequences)

These threads discuss how this can be done in TF:

TensorFlow: Remember LSTM state for next batch (stateful LSTM)

How do I set TensorFlow RNN state when state_is_tuple=True?

Did what you described as well. But as I tag sentence elements I need to go bidirectional. I achieved much better results on the fixed-length implementation. In the bidirectional case I cannot store the final states and use them in the next batch, as I would need to decide whether to save and reuse the state in the forward or the backward pass. Hence I am looking into the bidirectional dynamic case and hope to find a "onestopp"-solution there. — Phillip Bock, May 24 '17 at 10:55
@friesel Indeed, this trick is for unidirectional RNNs (I added a clarification). — MWB, May 24 '17 at 17:14
I take your lines as "yes, indeed you can have dynamically long sequences in the bidirectional approach and are limited only by the memory the gpu and cpu have access to in the specific moment they need it, not already a priori". Thx — Phillip Bock, May 28 '17 at 20:01
@friesel I just updated the answer. I hope it's perfectly clear. — MWB, May 28 '17 at 20:36

swap_memory in dynamic_rnn allows quasi-infinite sequences?

1 Answers1