This answer says:
If there's a mask in your model, it'll be propagated layer-by-layer and eventually applied to the loss. So if you're padding and masking the sequences in a correct way, the loss on the padding placeholders would be ignored.
However in TensorFlow's tutorial on Transformers, the author has implemented custom loss and metric where masks are computed and applied internally. Is this necessary?
Note in the code of the Transformer model, the author has deleted the keras mask:
....
....
try:
# Drop the keras mask, so it doesn't scale the losses/metrics.
# b/250038731
del logits._keras_mask
except AttributeError:
pass
# Return the final output and the attention weights.
return logits
Do we need to implement a custom loss and metric with mask, or we can use the built-in ones?