In cs231n 2017 class, when we backpropagate the gradient we update the biases like this:
db = np.sum(dscores, axis=0, keepdims=True)
What's the basic idea behind the sum
operation? Thanks
In cs231n 2017 class, when we backpropagate the gradient we update the biases like this:
db = np.sum(dscores, axis=0, keepdims=True)
What's the basic idea behind the sum
operation? Thanks
This is the formula of derivative (more precisely gradient) of the loss function with respect to the bias (see this question and this post for derivation details).
The numpy.sum
call computes the per-column sums along the 0 axis. Example:
dscores = np.array([[1, 2, 3],[2, 3, 4]]) # a 2D matrix
db = np.sum(dscores, axis=0, keepdims=True) # result: [[3 5 7]]
The result is exactly element-wise sum [1, 2, 3] + [2, 3, 4] = [3 5 7]
. In addition, keepdims=True
preserves the rank of original matrix, that's why the result is [[3 5 7]]
instead of just [3 5 7]
.
By the way, if we were to compute np.sum(dscores, axis=1, keepdims=True)
, the result would be [[6] [9]]
.
[Update]
Apparently, the focus of this question is the formula itself. I'd like not to go too much off-topic here and just try to tell the main idea. The sum appears in the formula because of broadcasting over the mini-batch in the forward pass. If you take just one example at a time, the bias derivative is just the error signal, i.e. dscores
(see the links above explain it in detail). But for a batch of examples the gradients are added up due to linearity. That's why we take the sum along the batch axis=0
.