I'm troubleshooting a Keras/TensorFlow U-Net for semantic segmentation. One thing I keep coming across are claims like this....
"Now the problem is using the softmax in your case as Keras don't support softmax on each pixel."
which is stated here (as well as lots of other places): Cross Entropy Loss for Semantic Segmentation Keras
They typical solution is to unroll rows and cols into 1d so the output (4d) tensor of shape (batch, rows, cols, n_classes) becomes a 3d tensor (batch, rows*cols, n_classes) and apply dense softmax to that.
But this dummy example leads me to think that a Conv2d layer with 1x1 kernel and 'softmax' activation DOES perform per-pixel softmax.
Am I wrong?
Example:
import numpy as np
np.random.seed(345)
from keras.layers import Input
from keras.models import Model
from keras.layers.convolutional import Conv2D
from keras.initializers import RandomUniform
mimick input to per-pixel softmax classifier in this case a 3x3 field with 5 filters
np.random.seed(234)
dummy_input = np.random.random(45).astype('float32')
dummy_input = dummy_input.reshape((1,3,3,5))
build 2 networks with the same random weights (one linear, one softmax)
conv_input = Input(shape=(3,3,5,), dtype='float32')
pred_layer_softmax = Conv2D(filters=2, kernel_size=(1,1), strides=(1,1), kernel_initializer=RandomUniform(minval=0., maxval=1.),
padding='valid', data_format='channels_last', activation='softmax')(conv_input)
pred_layer_linear = Conv2D(filters=2, kernel_size=(1,1), strides=(1,1),
padding='valid', data_format='channels_last', activation='linear')(conv_input)
models
m_softmax = Model(conv_input, pred_layer_softmax)
m_linear = Model(conv_input, pred_layer_linear)
# keep weights the same for both networks
m_linear.set_weights(m_softmax.get_weights())
predictions
pred = m_softmax.predict(dummy_input)
pred_lin = m_linear.predict(dummy_input)
Do the two networks make the same predictions? yes.
print('\nsoftmax pred')
print(pred.argmax(axis=3))
print('\nlinear_pred')
print(pred_lin.argmax(axis=3))
softmax pred
[[[1 1 0]
[1 1 0]
[1 0 1]]]
linear_pred
[[[1 1 0]
[1 1 0]
[1 0 1]]]
Do the sum of per-pixel class probabilities add up to 1.0 for softmax? Yes.
print('\nsoftmax - sum class probs')
print(pred.sum(axis=3))
print('\nlinear - sum class probs')
print(pred_lin.sum(axis=3))
softmax - sum class probs
[[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]]
linear - sum class probs
[[[ 1.88952112 2.50639653 2.06084657]
[ 1.81122136 2.21819067 2.01038122]
[ 1.92753291 1.85993922 2.27295876]]]
What am I missing? This looks like per-pixel softmax is working fine, right?
Is there some fundamental thing I'm not understanding?
Thanks in advance.