According to Andrew Ng, when using tensorflow for classification it's better to use from_logits
. I.e. instead of:
model = Sequential([
...,
Dense(units=1, activation='sigmoid')
])
model.compile(..., BinaryCrossentropy())
the advice is to use
model = Sequential([
...,
Dense(units=1, activation='linear')
])
model.compile(..., BinaryCrossentropy(from_logits = True))
(and similar for multiclass).
As far as I understand, the only reason for doing so is to improve numeric stability.
This makes me wonder: why doesn't tensorflow do this transformation automatically? Surely the compile
method must be able to see that a sigmoid
activation function is used for the last layer, and then replace it with linear
and effectively set from_logits = True
internally? This would also allow TF to keep a consistent interface, e.g. make .predict
working as expected.
Is there any reason why TF would not want to do this? E.g. are there use cases where the first example is preferred over the second? Is there a performance penalty?