Why the output of hidden deep layers have zero variance for some dimensions?

Question

I have designed a simple ConvNet with keras which will process some images into two classes. The images are not real-world photos but they are signals converted to images. The model is trained well and results in %99 training-acc and %90 test-acc.

The model summary is as follows:

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_5 (Conv2D)            (None, 64, 64, 32)        832       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 32, 32, 32)        25632     
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 16, 16, 32)        0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 16, 16, 64)        51264     
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 8, 8, 64)          0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 5, 5, 64)          65600     
_________________________________________________________________
flatten_2 (Flatten)          (None, 1600)              0         
_________________________________________________________________
dense_12 (Dense)             (None, 256)               409856    
_________________________________________________________________
dense_13 (Dense)             (None, 32)                8224      
_________________________________________________________________
dense_14 (Dense)             (None, 2)                 66        
=================================================================
Total params: 561,474
Trainable params: 561,474
Non-trainable params: 0
_________________________________________________________________

The problem is, when I print out the output of data for hidden dense_12 and dense_13 layers, there are lots of dimensions with zero variance for the data. This -in my opinion- means that those outputs have no effect on the decision and in other words useless features that the network ignores them. The sample code for obtaining these hidden layer outputs is as follows:

dense_2_output = backend.function([model.input],[model.layers[-2].output])
dense_1_output = backend.function([model.input],[model.layers[-3].output])

train_weights_2 = dense_2_output([grid_train_image])[0]
train_weights_1 = dense_1_output([grid_train_image])[0]

x_2 = np.var(train_weights_2,0)
x_1 = np.var(train_weights_1,0)

Thinking of this model as a kind of classic SVM, the features with zero variance are useless and must be omitted. Does anyone have any idea how to omit these useless features to have a fully functional feature space?

Note: I also tried decreasing dimensions of the network dense layers from (256,32) to (180,20) and retrained the system. But there are still lots of zero variance features and the performance of the system slightly reduced. If you can guide me to read more about this or know the answer, I appreciate any help.

if you think that I have chosen the wrong forum, please tell me. — Hafez Mousavi, Jun 26 '19 at 08:01
I don't know if they have the same meaning but, the "Pruning in Keras" talks about weights of network being zero. But my weights are not zero. They are nonzero variety of numbers. The outputs of the nodes are zero for an entire training and test set (in hidden layers). — Hafez Mousavi, Jun 26 '19 at 10:06

score 0 · Accepted Answer · answered Jun 26 '19 at 08:41

Pruning in Keras is in principle not supported. The problem thereby is that first you get either sparse matrices, which perform slower than full matrix multiplications or mask vectors. And those nodes still contribute to the final result by being non-zero and the neurons in the following layer use a different scaling factor for that neuron. Just removing them will affect your performance, unless they are actually zero, which does happen if you use ReLUs.

One thing you could try is to use regularization to make the network more likely to use more nodes. E.g. use dropout or weight regularization.

Why the output of hidden deep layers have zero variance for some dimensions?

1 Answers1