3

Should we include the bias parameter in Conv2d if we are going for Conv2d followed by ReLU followed by batch norm (bn)?

There is no need if we go for Conv2d followed by bn followed by ReLU, since the shift parameter of bn takes care of bias work.

Ash
  • 4,611
  • 6
  • 27
  • 41
Venkataraman
  • 138
  • 1
  • 9
  • @jww To be fair, this is in the worst case a borderline fit, as it is an algorithmic question and it does ask about a developpement choice. It litterally checks out two of the four bullet points at the begining of the "what topics can I ask about." page. โ€“ Ash Dec 20 '19 at 07:12

1 Answers1

2

Yes, if the order is conv2d -> ReLU -> BatchNorm, then having a bias parameter in the convolution can help. To show that, let's assume that there is a bias in the convolution layer, and let's compare what happens with both of the orders you mention in the question. The idea is to see whether the bias is useful for each case.

Let's consider a single pixel from one of the convolution's output layers, and assume that x_1, ..., x_k are the corresponding inputs (in vectorised form) from the batch (batch size == k). We can write the convolution as

Wx+b #with W the convolution weights, b the bias

As you said in the question, when the order is conv2d-> BN -> ReLu, then the bias is not useful because all it does to the distribution of the Wx is shift it by b, and this is cancelled out by the immediate BN layer:

(Wx_i - mu)/sigma  ==> becomes (Wx_i + b - mu - b)/sigma i.e. no changes.

However, if you use the other order, i.e

BN(ReLU(Wx+b))

then ReLU will map some of the Wx_i+b to 0ยท As a consequence, the mean will look like this:

(1/k)(0+...+0+ SUM_s (Wx_s+b))=some_term + b/k

and the std will look like

const*((0-some_term-b/k)^2 + ... + (Wx_i+b - some_term -b/k)^2 +...)) 

and as you can see from expanding those therms that depend on non-zero Wx_i+b:

(Wx_i+b - some_term - b/k)^2 = some_other_term + some_factor * W * b/k * x_i

which means that the result will depend on b in a multiplicative manner. As a result, its absence can't just be compensated by the shift component of the BN layer (noted beta in most implementation and papers). That is why having a bias term when using this order is not useless.

Ash
  • 4,611
  • 6
  • 27
  • 41
  • Please close off-topic questions. โ€“ jww Dec 20 '19 at 02:56
  • Great answer, thanks a lot :). Just a question, in the 1st case: `conv2d-> BN -> ReLu`, would the bias have an influence at *inference time* when the mean and std is not based on the current stats? In this situation we would have `(Wx_i + b - mu_{training})/sigma_{training}`. My gut feeling tells me this may cause wrong/ unexpected results if the biases have not been disabled nor initialised to 0. โ€“ Javier TG Jun 28 '22 at 10:44