21

I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k) with corresponding labels (l_1, l_2, ..., l_k) (in this case the task is named entity extraction).

I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k) or (w_1, w_2, ..., w_k, 0, 0, ..., 0) such that the lenght of each sequence is the same.

How does the choice between pre- and post padding impact results?

It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.

langkilde
  • 1,473
  • 1
  • 20
  • 37

2 Answers2

15

Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).

If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.

So intuitively, this might be why pre-padding is more popular/effective.

nlml
  • 1,395
  • 12
  • 19
  • 4
    By the same logic, why would the series of padding tokens at the start of sequence essentially get you into a 'zero state' before you start encountering words, and thus you can't actually learn anything because of the zeroing effect of the state of pre-padding? It seems like it would cause the same problem. – ely Sep 26 '19 at 13:23
  • You make a good point. Probably better to use something like Pytorch's PackedSequence when dealing with variable sequence lengths https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch – nlml May 23 '20 at 04:43
  • 3
    Doesn't this answer ignore the fact that masking is a thing? I would love to see some peer-reviewed (ideally large-scale) study that shows actual benefits of one approach vs the other. In addition, TF, Keras have post as the default and [recommend it](https://www.tensorflow.org/guide/keras/masking_and_padding#padding_sequence_data), CUDNN does postpadding, and Huggingface Transformers only support postpadding, so I would really challenge that prepadding is more popular. Pytorch [doesnt even support](https://github.com/pytorch/pytorch/issues/10536) prepadding right now. – runDOSrun Feb 26 '21 at 10:39
  • My model doesn't learn anything at all with post-padding. It outputs the same value for all samples. After the laborious process of tweaking all other hyperparameters, I find out that changing the padding type to pre-padding fixes it. This is very weird, since mask_zero is set to True. I don't expect padding type to have such drastic effects. – ProteinGuy Mar 24 '21 at 03:36
5

This paper (https://arxiv.org/pdf/1903.07288.pdf) studied the effect of padding types on LSTM and CNN. They found that post-padding achieved substantially lower accuracy (nearly half) compared to pre-padding in LSTMs, although there wasn't a significant difference for CNNs (post-padding was only slightly worse).

A simple/intuitive explanation for RNNs is that, post-padding seems to add noise to what has been learned from the sequence through time, and there aren't more timesteps for the RNN to recover from this noise. With pre-padding, however, the RNN is better able to adjust to the added noise of zeros at the beginning as it learns from the sequence through time.

I think more thorough experiments are needed in the community for more detailed mechanistic explanations on how padding affects performance.

I always recommend using pre-padding over post-padding, even for CNNs, unless the problem specifically requires post-padding.

ProteinGuy
  • 1,754
  • 2
  • 17
  • 33