A similar problem would be, how many pixels will be covered by each activation? which is essentially the same as, how large an input image has to be in order to produce exactly one activation in a layer?
Say the filter size and stride of a layer is k
and s
, the size of the input is x*x
, we have (((x-k1+1)/s1-k2+1)/s2.../sn)=1
, and x
can be solved easily.
The original question is equivalent to, how large an input image has to be in order to produce exactly one activation in a layer, without considering the stride of the last layer?
So the answer is x/sn
, which should be computed by the following pseudocode
x = layer[n].k
from i = n-1 to 1
x = x*layer[i].s + layer[i].k - 1
the total amount of pixels is then x*x
.
In your example, the sum_1d
for the first layer is 5, for the second layer is 5*1+3-1=7, the third is 5*3+2+4=21 (I'm assuming the pooling layer is non-overlapping, s=3)..
You can verify this by doing the reverse, say the input is 21*21, after the first layer it is 17*17, after pooling it is (17-2)/3=5 (actually 16*16 and 15*15 will give the same result), which fits exactly into one filter in the third layer.