I am attempting to implement a Caffe Softmax layer with a "temperature" parameter. I am implementing a network utilizing the distillation technique outlined here.
Essentially, I would like my Softmax layer to utilize the Softmax w/ temperature function as follows:
F(X) = exp(zi(X)/T) / sum(exp(zl(X)/T))
Using this, I want to be able to tweak the temperature T
before training. I have found a similar question, but this question is attempting to implement Softmax with temperature on the deploy network. I am struggling to implement the additional Scale layer described as "option 4" in the first answer.
I am using the cifar10_full_train_test prototxt file found in Caffe's examples directory. I have tried making the following change:
Original
...
...
...
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip1"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip1"
bottom: "label"
top: "loss"
}
Modified
...
...
...
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip1"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
type: "Scale"
name: "temperature"
top: "zi/T"
bottom: "ip1"
scale_param {
filler: { type: 'constant' value: 0.025 } ### I wanted T = 40, so 1/40=.025
}
param { lr_mult: 0 decay_mult: 0 }
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip1"
bottom: "label"
top: "loss"
}
After a quick train (5,000 iterations), I checked to see if my classification probabilities are appearing more even, but they actually appeared to be less evenly distributed.
Example:
high temp T: F(X) = [0.2, 0.5, 0.1, 0.2]
low temp T: F(X) = [0.02, 0.95, 0.01, 0.02]
~my attempt: F(X) = [0, 1.0, 0, 0]
Do I appear to be on the right track with this implementation? Either way, what am I missing?