caffe multi-label training with lmdb to classifiy facial regions

Question

I'm using two lmdb inputs for identifying eyes, nosetip and mouth regions of a face. The data lmdb is of dimension Nx3xHxW while the label lmdb is of dimension Nx1xH/4xW/4. The label image is created by masking regions using numbers 1-4 on an opencv Mat that was initialized to be all 0s (so in total there are 5 labels with 0 being the background label). I scaled down the label image to be 1/4 in width and height of the corresponding image because I have 2 pooling layers in my net. This downscaling ensures the label image dimension will match the output of the last convolution layer.

My train_val.prototxt:

name: "facial_keypoints"
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TRAIN
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../train_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TRAIN
}
data_param {
source: "../train_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TEST
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TEST
}
data_param {
source: "../test_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "images"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.0001
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "pool1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 64
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "conv_last"
type: "Convolution"
bottom: "pool2"
top: "conv_last"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 5
pad: 2
kernel_size: 5
stride: 1
weight_filler {
#type: "xavier"
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv_last"
top: "conv_last"
}

layer {
name: "accuracy"
type: "Accuracy"
bottom: "conv_last"
bottom: "labels"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "conv_last"
bottom: "labels"
top: "loss"
}

In the last convolution layer, I set the output size to be 5 because I have 5 label classes. The training is converged with final loss at about 0.3 and accuracy 0.9 (although some sources suggest this accuracy is not correctly measured for multi-lablels). When using the trained model, the output layer correctly produces a blob of dimension 1x5xH/4xW/4 which I managed to visualze as 5 separate single channel images. However, while the first image correctly hightlighted the background pixels, the remaining 4 images looks almost the same with all 4 regions highlighted.

visualization of 5 output channels(intensity increases from blue to red):
click to view picture

original image(the concentric circles labels highest intensity from each channel. some are bigger just to distinguish from others. as you can see other than the background, the rest channels have highest activations almost on the same mouth region which should not be the case. )
click to view picture

Could someone help me spot the mistake I made?

Thanks.

are you running a "Softmax" layer when running the trained model on new face images? — Shai, Dec 01 '15 at 13:26
hi @Shai, yes I used Softmax when running trained model, is that correct? — projectcs2103t, Dec 01 '15 at 13:50
(1) if you train with `SoftmaxWithLoss` you should have `Softmax` in the deploy prototxt. So, in that respect you are correct. (2) However, if you are looking at a softmax output how come all 4 labels are "highlighted" for the same pixels? what are the underlying probabilities? Is it possible your model only learns to distinguish the background? Is it possible you have 90% background in your training images? In that case constantly outputing "background" will give you 90% accuracy... — Shai, Dec 01 '15 at 14:01
@Shai you are right the images have mainly backgrounds, ie other parts of a face. but if i'm correct accuracy does not have a play in training, it only shows in testing phase during training. It is the loss which the neural net tried to minimize by taking into account of difference between predicted pixel label and the ground-truth label. So it might be the case that most pixels are labelled background. Nevertheless I would expect at least some variations in the output of the 4 labels but in reality they do look almost the same. Please see the newly posted sample outputs pictures. — projectcs2103t, Dec 01 '15 at 15:00
@Shai http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-pixels.pdf page 27 is the desired output I'm trying to achieve, and from page 26, the label size is not big as compared to the background. So I think label size is not the main issue here. — projectcs2103t, Dec 01 '15 at 15:22
@Shai Sorry you are correct. After I tried to make the labelled regions bigger such that the total area of the labels became almost equal to the background area, the output showed different results for diffent channels. Thanks a lot! — projectcs2103t, Dec 01 '15 at 15:56
consider modifying infogain loss layer to handle 2D labels. Then you can tackle class imbalance using info gain loss. See [this answer](http://stackoverflow.com/a/30497907/1714410) for more details. — Shai, Dec 01 '15 at 16:25
@Shai that's great information, I'll try that out. Sorry I could not upvote your comments, do you mind post them as answer so that I can accept it? Thanks. — projectcs2103t, Dec 02 '15 at 02:51

score 0 · Accepted Answer · edited May 23 '17 at 11:53

It seems like you are facing class imbalance: most of your labeled pixels are labeled 0 (Background), hence, during training the net learns to predict background almost regardless of what it "sees". Since predicting background is correct most of the time, the training loss decreases and the accuracy increases up to a certain point.
However, when you actually try to visualize the output prediction it is mostly background with little information regarding the other scarce labels.

One way of tackling class imbalance, in caffe, is to use "InfogainLoss" layer with weights tuned to counter-effect the imbalance of the labels.

You could also just have a `ignore_label = 0` in your loss layer params. — , Nov 18 '16 at 13:07

caffe multi-label training with lmdb to classifiy facial regions

1 Answers1