Tensorflow U-Net Multiclass Label

Question

I'm new to stackoverflow so please apologize any typical newbie mistakes.

I want to set up a CNN with U-Net architecture in Python and Tensorflow. I tried to reuse some code I got which works on binary classification and wanted to adapt it to detect 3 classes. The code I got works great for 2 output layers which has a binary image as label groundtruth.

Now my question is: Are there any conventions how multiclass labels should look like? Should I use an labelimage with only one layer (grayscale) with three values for my different classes (like 0, 127, 255)? Or should I use a rgb image with one colour for every class (like 255, 0, 0 for class 0; 0, 255, 0 for class 1 and so on...)?

""" 0) Creating placeholders for input images and labels """
# Placeholder for input images
x = tf.placeholder(tf.float32, [None, 3*img_size]) # None = arbitrary (Number of images)
# Arrangeing images in 4D format
x_shaped = tf.reshape(x, [-1, img_height, img_width, 3]) # 3 for 3 channels RGB
# Placeholder for labels of input images (ground truth)
y = tf.placeholder(tf.float32, [None, 2*img_size])
# Arrangeing labels in 4D format
y_shaped = tf.reshape(y, [-1, img_size, 2])


""" 1) Defining FCN-8 VGGNet-16 """
network = conv_layer(x_shaped, 64, filter_size=[3, 3], name='conv1a')
network = conv_layer(network, 64, filter_size=[3, 3], name='conv1b')
network = max_pool_layer(network, name='pool1')

network = conv_layer(network, 128, filter_size=[3, 3], name='conv2a')
network = conv_layer(network, 128, filter_size=[3, 3], name='conv2b')
network = max_pool_layer(network, name='pool2')

network = conv_layer(network, 256, filter_size=[3, 3], name='conv3a')
network = conv_layer(network, 256, filter_size=[3, 3], name='conv3b')
network = conv_layer(network, 256, filter_size=[3, 3], name='conv3c')
network = max_pool_layer(network, name='pool3')
net_pool3 = network

network = conv_layer(network, 512, filter_size=[3, 3], name='conv4a')
network = conv_layer(network, 512, filter_size=[3, 3], name='conv4b')
network = conv_layer(network, 512, filter_size=[3, 3], name='conv4c')
network = max_pool_layer(network, name='pool4')
net_pool4 = network

network = conv_layer(network, 512, filter_size=[3, 3], name='conv5a')
network = conv_layer(network, 512, filter_size=[3, 3], name='conv5b')
network = conv_layer(network, 512, filter_size=[3, 3], name='conv5c')
network = max_pool_layer(network, name='pool5')

network = deconv_layer(network, 256, filter_size=[3, 3], name='deconv1')
network = tf.concat([network, net_pool4], 3)
network = conv_layer(network, 256, filter_size=[5, 5], name='conv6')

network = deconv_layer(network, 128, filter_size=[3, 3], name='deconv2')
network = tf.concat([network, net_pool3], 3)
network = conv_layer(network, 128, filter_size=[5, 5], name='conv7')

# in the next lines I would have to change 2 into 3 to get 3 output classes
network = deconv_layer(network, 2, filter_size=[7, 7], strides=[8, 8], name='deconv3')
network = conv_layer(network, 2, filter_size=[7, 7], activation=' ', name='conv8')
y_ = tf.nn.softmax(network)

After computing I generate an output image (in the test phase, after training is completed)

for i in range(rows):
    for j in range(cols):
        for k in range(layers):
            imdata[i*img_height:(i+1)*img_height, j*img_width:(j+1)*img_width, k] = cnn_output[cols*i+j, :, :, k]
imdata = imdata[0:im.height, 0:im.width]
for row in range(real_height):
            for col in range(real_width):
                if(np.amax(imdata[row,col,:]) == imdata[row,col,0]):
                    imdata[row,col,:] = 255, 0, 0
                elif(np.amax(imdata[row,col,:]) == imdata[row,col,1]):
                    imdata[row,col,:] = 0, 255, 0
                else:
                    imdata[row,col,:] = 0, 0, 255
                #img[row][col] = imdata[row][col]
        # Save the image
        scipy.misc.imsave(out_file, imdata)
        im.close()

imdata has the shape of my image with 3 layers (1080, 1920, 3).

Why do you have 2 output channels? And why would using 3 channels make any difference other than the hard-coded `2`s here and there? — ShlomiF, Dec 03 '18 at 21:04
As far as I understood the number of output channels should match the number of classes I want to identify. Or am I mistaken? — Cal Blau, Dec 03 '18 at 21:15
In the case of two classes there's some redundancy there. But never mind. Why are you worried about the 3 channel case? It's the same, but with 3 channels and following softmax. — ShlomiF, Dec 03 '18 at 21:17
The main question is how the label image should look like (grayscale or rgb) or if this doesn't matter at all? — Cal Blau, Dec 03 '18 at 21:19
the labels are just numbers. If you're using softmax then the labels should be binary; 1 for correct class, 0 for wrong class. Is that RGB or GS? I don't think that's defined... — ShlomiF, Dec 03 '18 at 21:21
Can't I directly identify 3 classes? Like labeling them with 1 for object1, 2 for object2 and 0 for background? Of course I could train 2 CNNs and first identify all objects and with the second one identify object1 und get the object2s via difference but there has to be a smoother solution? — Cal Blau, Dec 03 '18 at 21:25
You can. But then you shouldn't have that softmax there at the end, which will squash things down to the 0-1 range. Rather just use the `l2` norm of the difference between the last three layers and the 3-channel labels. — ShlomiF, Dec 03 '18 at 21:28

score 3 · Accepted Answer · answered Dec 03 '18 at 23:05

If I understood your question right, you want to know how your label-image should be for a 3-class problem.

Let's see how it should be for a two-class problem first. The label-image would consist of just zeros and ones and you would use a binary cross-entropy loss for each pixel and then (maybe) average it over the whole image.

For a n-class problem, your label-image would be of the size of H x W x n where if you take a slice across the entire depth, it would be a one-hot encoded vector. So the vector would have all but one zeros and a single one (corresponding to the class).

Both the images are taken from here. I encourage you to read that blog.

Once you predict your label-image, you could easily convert it by assigning specific colors to labels. For example, in a 2-class segmented image, label 0 => color 0 and label 1 => color 255 - that is a binary image.

For a n-class segmented image, you could get n-equidistant points in the range [0, 0, 0] to [255, 255, 255] and then assign each of these colors to a label. Usually, you could choose such colors manually (e.g. red, green, blue, yellow for 4 classes) but if you want to get really fancy, you could use something like this.

Regic · Answer 2 · 2018-12-03T23:30:54.910

Classification labels are generally a vector where each element represents a class:

class A: [1, 0, 0]
class B: [0, 1, 0]
class C: [0, 0, 1]

The reason is that the output of the your network is a softmax function which will produce a vector of values between 0 and 1. E.g. it can output [0.1, 0.1, 0.8]. The values will always add up to 1, so using softmax assumes that every pixel on the picture can only belong to one class, since increase in the network output for one class will lower the output for other classes.

In a segmentation a class is assigned to every point, so your input is now 3*img_size instead of 2*img_size:

# Placeholder for labels of input images (ground truth)
y = tf.placeholder(tf.float32, [None, 3*img_size])
# Arranging labels in 4D format
y_shaped = tf.reshape(y, [-1, img_size, 3])

For the output:

I assume cnn_output contains the output for only one picture, not for the whole batch.

You need to find out which class has the highest score. In this the np.argmax can help:

class_index = np.argmax(cnn_output, axis=2)

class_index now contains the class number with the highest score. (If cnn_output is only 2 dimensional, set axis to 1.) Next you need to map these values to colors:

colors = {0 : [255, 0, 0], 1 : [0, 255, 0], 2 : [0, 0, 255]}
colored_image = np.array([colors[x] for x in np.nditer(class_index)], 
                         dtype=np.uint8)
output_image = np.reshape(colored_image, (img_height, img_width, 3))

First we created the colored_image which now contains the colors for each point, but is a one dimensional array, so you have to convert it to a 3 dimensional array by np.reshape. You can now draw the output_image:

plt.imshow(output_image)
plt.show()

I want every pixel to be only one class. I will edit my original question to give you some code. — Cal Blau, Dec 03 '18 at 22:05

Tensorflow U-Net Multiclass Label

2 Answers2