I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k)
with corresponding labels (l_1, l_2, ..., l_k)
(in this case the task is named entity extraction).
I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k)
or (w_1, w_2, ..., w_k, 0, 0, ..., 0)
such that the lenght of each sequence is the same.
How does the choice between pre- and post padding impact results?
It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.