1

I am attempting to implement a Caffe Softmax layer with a "temperature" parameter. I am implementing a network utilizing the distillation technique outlined here.

Essentially, I would like my Softmax layer to utilize the Softmax w/ temperature function as follows:

F(X) = exp(zi(X)/T) / sum(exp(zl(X)/T))

Using this, I want to be able to tweak the temperature T before training. I have found a similar question, but this question is attempting to implement Softmax with temperature on the deploy network. I am struggling to implement the additional Scale layer described as "option 4" in the first answer.

I am using the cifar10_full_train_test prototxt file found in Caffe's examples directory. I have tried making the following change:

Original

...
...
...
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip1"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip1"
  bottom: "label"
  top: "loss"
}

Modified

...
...
...
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip1"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
layer {
  type: "Scale"
  name: "temperature"
  top: "zi/T"
  bottom: "ip1"
  scale_param {
    filler: { type: 'constant' value: 0.025 } ### I wanted T = 40, so 1/40=.025
  }
  param { lr_mult: 0 decay_mult: 0 }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip1"
  bottom: "label"
  top: "loss"
}

After a quick train (5,000 iterations), I checked to see if my classification probabilities are appearing more even, but they actually appeared to be less evenly distributed.

Example:

high temp T: F(X) = [0.2, 0.5, 0.1, 0.2]

low temp T: F(X) = [0.02, 0.95, 0.01, 0.02]

~my attempt: F(X) = [0, 1.0, 0, 0]


Do I appear to be on the right track with this implementation? Either way, what am I missing?

Mink
  • 857
  • 2
  • 11
  • 24

2 Answers2

1

You are not using the "cooled" predictions "zi/T" your "Scale" layer produce.

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "zi/T"  # Use the "cooled" predictions instead of the originals.
  bottom: "label"
  top: "loss"
}
Shai
  • 111,146
  • 38
  • 238
  • 371
  • thanks for pointing that out! However, my outputs are not what I am desiring. When deploying the corrected network, my top 1 class predictions for each image in my test set all have probability of 1.0, with the rest of the classes obviously having 0.0. Is my implementation of the Scale layer doing what I think it is doing? I would like my probability vector to be more "evenly distributed" between classes. – Mink Jul 20 '17 at 14:41
  • 1
    @Mink the prototxt you showed here does not output class probabilities, but rather a **scalar** loss. You probably have a corresponding `"Softmax"` layer (not `"SoftmaxWithLoss"`). In that case, make sure the `"bottom"` of this `"Softmax"` layer is also `"zi/T"` and not `"ip1"`. – Shai Jul 20 '17 at 14:45
0

The accepted answer has helped me to understand my misconceptions regarding the Softmax temperature implementation.

As @Shai pointed out, in order to observe the "cooled" probability outputs as I was expecting, the Scale layer must only be added to the "deploy" prototxt file. It is not necessary to include the Scale layer in the train/val prototxt at all. In other words, the temperature must be applied to the Softmax layer, not the SoftmaxWithLoss layer.

If you want to apply the "cooled" effect to your probability vector, simply make sure your last two layers are as such:

deploy.prototxt

layer {
  type: "Scale"
  name: "temperature"
  top: "zi/T"
  bottom: "ip1"
  scale_param {
    filler: { type: 'constant' value: 1/T } ## Replace "1/T" with actual 1/T value
  }
  param { lr_mult: 0 decay_mult: 0 }
}
layer {
  name: "prob"
  type: "Softmax"
  bottom: "zi/T"
  top: "prob"
}

My confusion was due primarily to my misunderstanding of the difference between SoftmaxWithLoss and Softmax.

Mink
  • 857
  • 2
  • 11
  • 24