I'm using two lmdb inputs for identifying eyes, nosetip and mouth regions of a face. The data lmdb is of dimension N
x3xH
xW
while the label lmdb is of dimension N
x1xH
/4xW
/4. The label image is created by masking regions using numbers 1-4 on an opencv Mat that was initialized to be all 0s (so in total there are 5 labels with 0 being the background label). I scaled down the label image to be 1/4 in width and height of the corresponding image because I have 2 pooling layers in my net. This downscaling ensures the label image dimension will match the output of the last convolution layer.
My train_val.prototxt:
name: "facial_keypoints"
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TRAIN
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../train_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TRAIN
}
data_param {
source: "../train_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TEST
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TEST
}
data_param {
source: "../test_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "images"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.0001
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "pool1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 64
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "conv_last"
type: "Convolution"
bottom: "pool2"
top: "conv_last"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 5
pad: 2
kernel_size: 5
stride: 1
weight_filler {
#type: "xavier"
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv_last"
top: "conv_last"
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "conv_last"
bottom: "labels"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "conv_last"
bottom: "labels"
top: "loss"
}
In the last convolution layer, I set the output size to be 5 because I have 5 label classes. The training is converged with final loss at about 0.3 and accuracy 0.9 (although some sources suggest this accuracy is not correctly measured for multi-lablels). When using the trained model, the output layer correctly produces a blob of dimension 1x5xH
/4xW
/4 which I managed to visualze as 5 separate single channel images. However, while the first image correctly hightlighted the background pixels, the remaining 4 images looks almost the same with all 4 regions highlighted.
visualization of 5 output channels(intensity increases from blue to red):
original image(the concentric circles labels highest intensity from each channel. some are bigger just to distinguish from others. as you can see other than the background, the rest channels have highest activations almost on the same mouth region which should not be the case. )
Could someone help me spot the mistake I made?
Thanks.