I am trying to understand the reason behind the answer to this question. I was expecting the number of parameters to be:
total_params = (filter_height * filter_width + 1) * number_of_filters
BUT you have to multiply the height and width by the number of input channels. Why is this? Isn't there parameter sharing for this dimension? If this is the case, how does this help with feature recognition?
I would expect a CNN to be able to infer relationships between channels, but I haven't seen how this is explicitly done.