What is the activation layer used for TensorFlow text classification example

Question

I am trying to understand the TensorFlow text classification example at https://www.tensorflow.org/tutorials/keras/text_classification. They define the model as follows:

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1)])

To the best of my knowledge, deep learning models use an activation function and I wonder what activation function the above classification model uses internally. Can anyone help me understand that?

I don't believe that simple model uses a non-linear activation function. — Aaron Keesing, Apr 15 '21 at 07:43

score 2 · Accepted Answer · answered Apr 15 '21 at 08:30

As you read, the model definition is written something like this

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1)])

And the data set used in that tutorials is a binary classification zero and one. By not defining any activation to the last layer of the model, the original author wants to get the logits rather than probability. And that why they used the loss function as

model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              ...

Now, if we set the last layer activation as sigmoid (which usually pick for binary classification), then we must set from_logits=False. So, here are two option to chose from:

with logit: True

We take the logit from the last layer and that why we set from_logits=True.

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1, activation=None)])

model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(
    train_ds, verbose=2,
    validation_data=val_ds,
    epochs=epochs)

7ms/step - loss: 0.6828 - accuracy: 0.5054 - val_loss: 0.6148 - val_accuracy: 0.5452
Epoch 2/3
7ms/step - loss: 0.5797 - accuracy: 0.6153 - val_loss: 0.4976 - val_accuracy: 0.7406
Epoch 3/3
7ms/step - loss: 0.4664 - accuracy: 0.7734 - val_loss: 0.4197 - val_accuracy: 0.8096

without logit: False

And here we take the probability from the last layer and that why we set from_logits=False.

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1, activation='sigmoid')])

model.compile(loss=losses.BinaryCrossentropy(from_logits=False),
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(
    train_ds, verbose=2,
    validation_data=val_ds,
    epochs=epochs)

Epoch 1/3
8ms/step - loss: 0.6818 - accuracy: 0.6163 - val_loss: 0.6135 - val_accuracy: 0.7736
Epoch 2/3
7ms/step - loss: 0.5787 - accuracy: 0.7871 - val_loss: 0.4973 - val_accuracy: 0.8226
Epoch 3/3
8ms/step - loss: 0.4650 - accuracy: 0.8365 - val_loss: 0.4195 - val_accuracy: 0.8472

Now, you may wonder, why this tutorial uses logit (or no activation to the last layer)? The short answer is, it generally doesn't matter, we can choose any option. The thing is, there is a chance of numerical instability in the case of using from_logits=False. Check this answer for more details.

I am interested in enhancing your answer by commenting this part of the tutorial: "Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation)" + the fact that a sigmoid is used when making predictions. — David Thery, Apr 15 '21 at 08:45
I see. I didn't notice their statement there. Thanks for pointing it out. — Innat, Apr 15 '21 at 09:51
Thanks @M.Innat, does this answer applicable for multi class classifier using the same model with `losses.SparseCategoricalCrossentropy` loss function? — Venkat Papana, Apr 15 '21 at 10:11
Yes, it applicable for multiclass cases. See [this](https://keras.io/examples/keras_recipes/antirectifier/) example. — Innat, Apr 15 '21 at 10:44

David Thery · Answer 2 · 2021-04-15T11:20:21.443

This model uses a single activation function at the output (a sigmoid), used for predictions for a binary classification task.

The task to perform often guides the choice of both loss and activation functions. In this case, therefore, the Binary-Cross-Entropy loss function is used, as well as the sigmoid activation function (which is also called the logistic function, and outputs values between 0 and 1 for any real value taken as input). This is quite well explained in this post.

In contrast, you can also have multiple activation functions in a neural network, depending on its architecture; it is very common for instance in convolutional neural networks to have an activation function for each convolutional layer, as shown in this tutorial.

What is the activation layer used for TensorFlow text classification example

2 Answers2

with logit: True

without logit: False