These days (2022) two methods you can use to pad sequences in tensorflow are using a tf.data.Dataset pipeline, or preprocessing with tf.keras.utils.pad_sequences.
Method 1: Use Tensorflow Pipelines (tf.data.Dataset)
The padded_batch() method can be used in place of a normal batch() method to pad the elements of a tf.data.Dataset object when batching for model training: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch
The 'batching tensors with padding' pipeline is also described here: https://www.tensorflow.org/guide/data#batching_tensors_with_padding
The call signature is:
padded_batch(
batch_size,
padded_shapes=None,
padding_values=None,
drop_remainder=False,
name=None
)
An example for your use case of inputting to an RNN is:
import tensorflow as tf
import numpy as np
# input is a ragged tensor of different sequence lengths
inputs = tf.ragged.constant([[1], [2, 3], [4, 5, 6]], dtype = tf.float32)
# construct dataset using tf.data.Dataset
dataset = tf.data.Dataset.from_tensor_slices(inputs)
# convert ragged tensor to dense tensor to avoid TypeError
dataset = dataset.map(lambda x: x)
# pad sequences using padded_batch
dataset = dataset.padded_batch(3)
# run the batch through a simple RNN model
simple_rnn = tf.keras.Sequential([
tf.keras.layers.SimpleRNN(4)
])
output = simple_rnn(batch)
Note that this method does not allow you to use pre-padding, the method is always post-padding. However, you can use padded_shapes
argument to specify the sequence length.
Method 2: Preprocess sequence as nested list using Keras pad_sequences
Keras (a package sitting on top of Tensorflow since version 2.0) provides a utility function to truncate and pad Python lists to a common length: https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences
The call signature is:
tf.keras.utils.pad_sequences(
sequences,
maxlen=None,
dtype='int32',
padding='pre',
truncating='pre',
value=0.0
)
From the documentation:
This function transforms a list (of length num_samples
) of sequences
(lists of integers) into a 2D Numpy array of shape
(num_samples,num_timesteps)
. num_timesteps
is either the maxlen
argument if provided, or the length of the longest sequence in the list.
Sequences that are shorter than num_timesteps
are padded with value
until they are num_timesteps
long.
Sequences longer than num_timesteps
are truncated so that they fit the
desired length.
The position where padding or truncation happens is determined by the
arguments padding
and truncating
, respectively. Pre-padding or
removing values from the beginning of the sequence is the default.
An example for your use case of inputting to an RNN:
import tensorflow as tf
import numpy as np
# inputs is list of varying length sequences with batch size (list length) 3
inputs = [[1], [2, 3], [4, 5, 6]]
# pad the sequences with 0's using pre-padding (default values)
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs, dtype = np.float32)
# add an outer batch dimension for RNN input
inputs = tf.expand_dims(inputs, axis = 0)
# run the batch through a simple RNN layer
simple_rnn = tf.keras.layers.SimpleRNN(4)
output = simple_rnn(inputs)