padding a batch with 0 vectors in dynamic rnn

Question

I have a prediction task working with variable sequences of input data. Directly using a dynamic rnn will run into the trouble of splitting the outputs according to this post:

Using a variable for num_splits for tf.split()

So, I am wondering if is it possible to pad an entire batch of sequence to make all examples have the same number of sequences and then in sequence_length parameter of tf.nn.dynamic_rnn I give 0 length for the padded batch of sequence. Would this work?

score 1 · Answer 1 · answered Aug 30 '22 at 02:38

These days (2022) two methods you can use to pad sequences in tensorflow are using a tf.data.Dataset pipeline, or preprocessing with tf.keras.utils.pad_sequences.

Method 1: Use Tensorflow Pipelines (tf.data.Dataset)

The padded_batch() method can be used in place of a normal batch() method to pad the elements of a tf.data.Dataset object when batching for model training: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch

The 'batching tensors with padding' pipeline is also described here: https://www.tensorflow.org/guide/data#batching_tensors_with_padding

The call signature is:

padded_batch(
    batch_size,
    padded_shapes=None,
    padding_values=None,
    drop_remainder=False,
    name=None
)

An example for your use case of inputting to an RNN is:

import tensorflow as tf
import numpy as np
# input is a ragged tensor of different sequence lengths
inputs = tf.ragged.constant([[1], [2, 3], [4, 5, 6]], dtype = tf.float32)
# construct dataset using tf.data.Dataset
dataset = tf.data.Dataset.from_tensor_slices(inputs)
# convert ragged tensor to dense tensor to avoid TypeError
dataset = dataset.map(lambda x: x)
# pad sequences using padded_batch
dataset = dataset.padded_batch(3)

# run the batch through a simple RNN model
simple_rnn = tf.keras.Sequential([
    tf.keras.layers.SimpleRNN(4)
])
output = simple_rnn(batch)

Note that this method does not allow you to use pre-padding, the method is always post-padding. However, you can use padded_shapes argument to specify the sequence length.

Method 2: Preprocess sequence as nested list using Keras pad_sequences

Keras (a package sitting on top of Tensorflow since version 2.0) provides a utility function to truncate and pad Python lists to a common length: https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences

The call signature is:

tf.keras.utils.pad_sequences(
    sequences,
    maxlen=None,
    dtype='int32',
    padding='pre',
    truncating='pre',
    value=0.0
)

From the documentation:

This function transforms a list (of length num_samples) of sequences (lists of integers) into a 2D Numpy array of shape (num_samples,num_timesteps). num_timesteps is either the maxlen argument if provided, or the length of the longest sequence in the list.

Sequences that are shorter than num_timesteps are padded with value until they are num_timesteps long.

Sequences longer than num_timesteps are truncated so that they fit the desired length.

The position where padding or truncation happens is determined by the arguments padding and truncating, respectively. Pre-padding or removing values from the beginning of the sequence is the default.

An example for your use case of inputting to an RNN:

import tensorflow as tf
import numpy as np
# inputs is list of varying length sequences with batch size (list length) 3
inputs = [[1], [2, 3], [4, 5, 6]]
# pad the sequences with 0's using pre-padding (default values)
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs, dtype = np.float32)
# add an outer batch dimension for RNN input
inputs = tf.expand_dims(inputs, axis = 0)

# run the batch through a simple RNN layer
simple_rnn = tf.keras.layers.SimpleRNN(4)
output = simple_rnn(inputs)

score 0 · Answer 2 · answered Jun 21 '22 at 11:17

I'm a little late to the party, but are you looking for torch.nn.utils.rnn.pad_sequence?

Example (from documentation):

>>> from torch.nn.utils.rnn import pad_sequence
>>> a = torch.ones(25, 300)
>>> b = torch.ones(22, 300)
>>> c = torch.ones(15, 300)
>>> pad_sequence([a, b, c]).size()
torch.Size([25, 3, 300])

See the PyTorch documentation here.

score -1 · Answer 3 · answered Aug 13 '17 at 21:50

You have to define the max_length of your sequence. After that you can check if your input is less than the max and pad it with zero vector. More info here : https://danijar.com/variable-sequence-lengths-in-tensorflow/. So in your data generator you would have to check for every input feature vector, and perform the following:

len_vec = feature_vec.shape[0]
if len_vec < max_length:
  mis_dim = max_length - len_vec
  zero_vec = np.zeros((mis_dim, feature_vec.shape[1]))
  feature_vec = np.vstack((feature_vec, zero_vec))

thanks for the response, however, you are not answering the question. I know how the padding works the point is can we pad a entire batch of sequence with 0 vectors? — user1935724, Aug 13 '17 at 23:14

padding a batch with 0 vectors in dynamic rnn

3 Answers3

Method 1: Use Tensorflow Pipelines (tf.data.Dataset)

Method 2: Preprocess sequence as nested list using Keras pad_sequences