5

I am trying to learn a seq2seq model. An embedding layer is located in the encoder and it sometimes outputs nan value after some iterations. I cannot identify the reason. How can I solve this?? The problem is the first emb_layer in the forward function in the code below.


class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, hidden_size=1024, num_layers=6, dropout=0.2, input_pad=1, batch_first=False, embedder=None, init_weight=0.1):
        super(TransformerEncoder, self).__init__()
        self.input_pad = input_pad
        self.vocab_size = vocab_size
        self.num_layers = num_layers
        self.embedder = embedder

        if embedder is not None:
            self.emb_layer = embedder
        else:
            self.emb_layer = nn.Embedding(vocab_size, hidden_size, padding_idx=1)

        self.positional_encoder = PositionalEncoder()
        self.transformer_layers = nn.ModuleList()
        for _ in range(num_layers):
            self.transformer_layers.append(
                    TransformerEncoderBlock(num_heads=8, embedding_dim=1024, dropout=dropout))

    def set_mask(self, inputs):
        self.input_mask = (inputs == self.input_pad).unsqueeze(1)

    def forward(self, inputs):
        x = self.emb_layer(inputs)
        x = self.positional_encoder(x)
kintsuba
  • 139
  • 2
  • 7
  • Please start by identifying which is the corresponding input tensor for which you get the NaN values. Without knowing more about your data, it is fairly impossible to solve your problem by just looking at the code. – dennlinger Sep 18 '19 at 14:30
  • I see. Thank you for your advise. I will follow what you said first. – kintsuba Sep 20 '19 at 06:11

3 Answers3

2

It is usually the inputs more than the weights which tend to become nan (either goes too high or too low). Maybe these are incorrect to start out with and worsen after some gradients. You can identify these inputs by running the tensor or np.array thru' a simple condition check like:

print("Inp value too high") if len(bert_embeddings[bert_embeddings>1000]) > 1 else None

A common mistake for a beginner is to use a torch.empty instead of torch.zeros. This invariably leads to Nan over time.

If all your inputs are good, then it is the vanishing or exploding gradients issue. See if the problem worsens after a few iterations. Explore different activations or clipping gradients which usually fix these type of issues. If you are using latest optimizers you usually need not worry about adjusting the learning rate.

Allohvk
  • 915
  • 8
  • 14
  • 2
    I can confirm that using `torch.empty` in layers will most likely cause your model to output nans after some time. – Kinyugo Dec 14 '21 at 09:07
0

It looks like some weights become nan. The one of the possible reasons is that on some iteration a layer output is +-inf. If it output is +-inf on forward, on backward it will have a +-inf and as inf - inf = none, the weights will become none, and at all following iterations will output none.

You may check this just by tracking inf outputs in emb_layer.

If this is the reason, just try to avoid functions that may return inf values.

antoleb
  • 313
  • 1
  • 6
0

My db was so small compare to the dimensions of the tutorial I was following and my embeddings were way to big for my data available, so eventually the NaN propagates through the network. Making my embedding nets smaller (smaller number of factors / columns in the matrix) solved the NaN problem for me.

Yunnosch
  • 26,130
  • 9
  • 42
  • 54