21

In Tensorflow's tutorial of RNN: https://www.tensorflow.org/tutorials/recurrent . It mentions two parameters: batch size and time steps. I am confused by the concepts. In my opinion, RNN introduces batch is because the fact that the to-train sequence can be very long such that backpropagation cannot compute that long(exploding/vanishing gradients). So we divide the long to-train sequence into shorter sequences, each of which is a mini-batch and whose size is called "batch size". Am I right here?

Regarding time steps, RNN consists of only a cell (LSTM or GRU cell, or other cell) and this cell is sequential. We can understand the sequential concept by unrolling it. But unrolling a sequential cell is a concept, not real which means we do not implement it in unroll way. Suppose the to-train sequence is a text corpus. Then we feed one word each time to the RNN cell and then update the weights. So why do we have time steps here? Combining my understanding of the above "batch size", I am even more confused. Do we feed the cell one word or multiple words (batch size)?

derek
  • 9,358
  • 11
  • 53
  • 94

3 Answers3

16

Batch size pertains to the amount of training samples to consider at a time for updating your network weights. So, in a feedforward network, let's say you want to update your network weights based on computing your gradients from one word at a time, your batch_size = 1. As the gradients are computed from a single sample, this is computationally very cheap. On the other hand, it is also very erratic training.

To understand what happen during the training of such a feedforward network, I'll refer you to this very nice visual example of single_batch versus mini_batch to single_sample training.

However, you want to understand what happens with your num_steps variable. This is not the same as your batch_size. As you might have noticed, so far I have referred to feedforward networks. In a feedforward network, the output is determined from the network inputs and the input-output relation is mapped by the learned network relations:

hidden_activations(t) = f(input(t))

output(t) = g(hidden_activations(t)) = g(f(input(t)))

After a training pass of size batch_size, the gradient of your loss function with respect to each of the network parameters is computed and your weights updated.

In a recurrent neural network (RNN), however, your network functions a tad differently:

hidden_activations(t) = f(input(t), hidden_activations(t-1))

output(t) = g(hidden_activations(t)) = g(f(input(t), hidden_activations(t-1)))

=g(f(input(t), f(input(t-1), hidden_activations(t-2)))) = g(f(inp(t), f(inp(t-1), ... , f(inp(t=0), hidden_initial_state))))

As you might have surmised from the naming sense, the network retains a memory of its previous state, and the neuron activations are now also dependent on the previous network state and by extension on all states the network ever found itself to be in. Most RNNs employ a forgetfulness factor in order to attach more importance to more recent network states, but that is besides the point of your question.

Then, as you might surmise that it is computationally very, very expensive to calculate the gradients of the loss function with respect to network parameters if you have to consider backpropagation through all states since the creation of your network, there is a neat little trick to speed up your computation: approximate your gradients with a subset of historical network states num_steps.

If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above.

Uvar
  • 3,372
  • 12
  • 25
  • Instead of computing all states, we can compute the a subset of all states which means we just need have the last "num_steps" memories. For implementation, each memory is an array. So in this case, we need have a matrix of "num_steps" x "each memory size". Is my understanding right? – derek Jun 15 '17 at 19:09
  • I am still confused how exactly the training examples in each batch work. Suppose we have 5 training examples in a batch. Does it mean each training example will be fed into a hidden neuron cell so we have a total of 5 cells in the RNN? – derek Jun 15 '17 at 20:18
  • 4
    Suppose you have a data size of 100 and a batch size of 5, for 20 network parameter updates during each epoch. It first propagates the first 5 training examples, updates its parameters based on the optimization method you have provided, then takes the next 5, until it made a full pass over the data. The num_steps determines the amount of cells you unroll and hence the amount of data used in gradient computation. As each cell/layer shares parameters, this does not result in an increase in parameters to optimize, but it enables context learning, which is why you'd want RNNs in the first place. – Uvar Jun 19 '17 at 07:23
  • Comment became too long, so continuing here: Suppose you would like to be able to capture in a text corpus context relations like "clouds are in the ....". We all know what the network output could be, regardless of its input. For this you would need num_steps >=4 for the network to learn these kind of dependencies. Batch_size has nothing to do with dependencies, it just determines the amount of data used for each training batch. The larger a batch, the more representative the computed gradient is for the whole data set, but larger memory requirements – Uvar Jun 19 '17 at 07:41
10

I found this diagram which helped me visualize the data structure.

Data Structure

From the image, 'batch size' is the number of examples of a sequence you want to train your RNN with for that batch. 'Values per timestep' are your inputs.' (in my case, my RNN takes 6 inputs) and finally, your time steps are the 'length', so to speak, of the sequence you're training

I'm also learning about recurrent neural nets and how to prepare batches for one of my projects (and stumbled upon this thread trying to figure it out).

Batching for feedforward and recurrent nets are slightly different and when looking at different forums, terminology for both gets thrown around and it gets really confusing, so visualizing it is extremely helpful.

Hope this helps.

Joey Harwood
  • 961
  • 1
  • 17
  • 27
Nate Smith
  • 101
  • 1
  • 4
  • I embedded the image for you. You should do your best to summarize how the image answers the OP's question. Link only answer are considered low quality on stack overflow. – Joey Harwood Jan 11 '18 at 21:25
  • It looks like my edit to embed the image got rejected when you made your edit. I can make it again if you like. – Joey Harwood Jan 12 '18 at 15:03
4
  1. RNN's "batch size" is to speed up computation (as there're multiple lanes in parallel computation units); it's not mini-batch for backpropagation. An easy way to prove this is to play with different batch size values, an RNN cell with batch size=4 might be roughly 4 times faster than that of batch size=1 and their loss are usually very close.

  2. As to RNN's "time steps", let's look into the following code snippets from rnn.py. static_rnn() calls the cell for each input_ at a time and BasicRNNCell::call() implements its forward part logic. In a text prediction case, say batch size=8, we can think input_ here is 8 words from different sentences of in a big text corpus, not 8 consecutive words in a sentence. In my experience, we decide the value of time steps based on how deep we would like to model in "time" or "sequential dependency". Again, to predict next word in a text corpus with BasicRNNCell, small time steps might work. A large time step size, on the other hand, might suffer gradient exploding problem.

    def static_rnn(cell,
           inputs,
           initial_state=None,
           dtype=None,
           sequence_length=None,
           scope=None):
    """Creates a recurrent neural network specified by RNNCell `cell`.
       The simplest form of RNN network generated is:
    
       state = cell.zero_state(...)
       outputs = []
       for input_ in inputs:
           output, state = cell(input_, state)
       outputs.append(output)
       return (outputs, state)
    """
    
    class BasicRNNCell(_LayerRNNCell):
    def call(self, inputs, state):
    """Most basic RNN: output = new_state = 
       act(W * input + U * state + B).
    """
        gate_inputs = math_ops.matmul(
            array_ops.concat([inputs, state], 1), self._kernel)
        gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
        output = self._activation(gate_inputs)
        return output, output
    
  3. To visualize how these two parameters are related to the data set and weights, Erik Hallström's post is worth reading. From this diagram and above code snippets, it's obviously that RNN's "batch size" will no affect weights (wa, wb, and b) but "time steps" does. So, one could decide RNN's "time steps" based on their problem and network model and RNN's "batch size" based on computation platform and data set.

rcnjko
  • 51
  • 3