Trouble with Caffe network initialization

Question

Attached towards the end is the output of Caffe Training initialization.

My network consists of 4 convolutional layers followed by 3 fully connected layers.

I computed my filter and feature map sizes, and my calculations show them to be consistent.

Input is an image of dimensions (195 X 65).
First conv1 layer filters are (32 X 13 X 5), where 32 is number of filters. Stride is 1 along both axes.
Each output map of conv1 should be of size (183 X 61).
A MAX Pooling layer of size (2 X 2) with stride 2.
Output of pooling layer should be of size (92 X 31).
This is followed by conv2 layer of size (48 X 7 X 3).
Each feature map of conv2 layer should be of size ( 86 X 29)
This is followed by an average pooling with window size (2 X 2).
The output of pooling layer is of size ( 43 X 15)
This is followed by conv3 layer of size (64 X 7 X 3), where 64 is number of filters.
The output of conv3 should, each be of size ( 37 X 13)
This is immediately followed by conv4 layer of size ( 64 X 5 X 3),
Each feature map of output of conv4 should be of size (33 X 11)
Conv4 is followed by an average pooling of window size (2 X 2)
The output of this is of size of ( 17 X 6)
This is followed by 3 fully connected layers, each with 300 output neurons.

No Padding is done at any layer. Below is the output of caffe training initialization. It shows some errors at conv2, conv3 and then at fc6. I am not able to understand the source of those errors. Could someone please clarify ?

I1119 17:55:45.301447 14339 caffe.cpp:156] Using GPUs 0
I1119 17:55:45.447502 14339 solver.cpp:33] Initializing solver from parameters:
test_iter: 970
test_interval: 2908
base_lr: 0.01
display: 50
max_iter: 87240
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
stepsize: 28790
snapshot: 2908
snapshot_prefix: "snapshot"
solver_mode: GPU
device_id: 0
net: "train_val.prototxt"
solver_type: SGD
I1119 17:55:45.447551 14339 solver.cpp:81] Creating training net from net file: train_val.prototxt
I1119 17:55:45.448559 14339 net.cpp:316] The NetState phase (0) differed from the phase (1) specified by a rule in layer data
I1119 17:55:45.448586 14339 net.cpp:316] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I1119 17:55:45.448849 14339 net.cpp:47] Initializing net from parameters:
state {
phase: TRAIN
}
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
mean_file: "/home/uujjwal/libraries/digits/DIGITS/digits/jobs/20151118-160932-e0c7/mean.binaryproto"
crop_h: 195
crop_w: 65
}
data_param {
source: "/home/uujjwal/libraries/digits/DIGITS/digits/jobs/20151118-160932-e0c7/train_db"
batch_size: 100
backend: LMDB
prefetch: 4
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 32
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
kernel_h: 13
kernel_w: 5
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "conv1"
top: "conv1"
}
layer {
name: "norm1"
type: "LRN"
bottom: "conv1"
top: "norm1"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "norm1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
pad: 0
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 48
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0.1
}
kernel_h: 7
kernel_w: 3
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "norm2"
type: "LRN"
bottom: "conv2"
top: "norm2"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "norm2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 2
stride: 2
pad: 0
}
}
layer {
name: "conv3"
type: "Convolution"
bottom: "pool2"
top: "conv3"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 64
pad: 0
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
kernel_h: 7
kernel_w: 3
}
}
layer {
name: "relu3"
type: "ReLU"
bottom: "conv3"
top: "conv3"
}
layer {
name: "conv4"
type: "Convolution"
bottom: "conv3"
top: "conv4"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 64
pad: 0
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
kernel_h: 5
kernel_w: 3
}
}
layer {
name: "relu4"
type: "ReLU"
bottom: "conv4"
top: "conv4"
}
layer {
name: "norm4"
type: "LRN"
bottom: "conv4"
top: "norm4"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layer {
name: "pool4"
type: "Pooling"
bottom: "norm4"
top: "pool4"
pooling_param {
pool: AVE
kernel_size: 2
stride: 2
pad: 0
}
}
layer {
name: "fc6"
type: "InnerProduct"
bottom: "pool4"
top: "fc6"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 300
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 0.1
}
}
}
layer {
name: "relu6"
type: "ReLU"
bottom: "fc6"
top: "fc6"
}
layer {
name: "drop6"
type: "Dropout"
bottom: "fc6"
top: "fc6"
dropout_param {
dropout_ratio: 0.5
}
}
layer {
name: "fc7"
type: "InnerProduct"
bottom: "fc6"
top: "fc7"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 300
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 0.1
}
}
}
layer {
name: "relu7"
type: "ReLU"
bottom: "fc7"
top: "fc7"
}
layer {
name: "drop7"
type: "Dropout"
bottom: "fc7"
top: "fc7"
dropout_param {
dropout_ratio: 0.5
}
}
layer {
name: "fc8_retrain_retrain_retrain"
type: "InnerProduct"
bottom: "fc7"
top: "fc8"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "fc8"
bottom: "label"
top: "loss"
}
I1119 17:55:45.448966 14339 layer_factory.hpp:75] Creating layer data
I1119 17:55:45.449409 14339 net.cpp:99] Creating Layer data
I1119 17:55:45.449422 14339 net.cpp:409] data -> data
I1119 17:55:45.449450 14339 net.cpp:409] data -> label
I1119 17:55:45.449457 14339 net.cpp:131] Setting up data
I1119 17:55:45.453604 14349 db.cpp:34] Opened lmdb /home/uujjwal/libraries/digits/DIGITS/digits/jobs/20151118-160932-e0c7/train_db
I1119 17:55:45.454463 14339 data_layer.cpp:37] Decoding Datum
I1119 17:55:45.454485 14339 data_layer.cpp:65] output data size: 100,1,195,65
I1119 17:55:45.454495 14339 data_transformer.cpp:28] Loading mean file from: /home/uujjwal/libraries/digits/DIGITS/digits/jobs/20151118-160932-e0c7/mean.binaryproto
I1119 17:55:45.466743 14339 net.cpp:140] Top shape: 100 1 195 65 (1267500)
I1119 17:55:45.466761 14339 net.cpp:140] Top shape: 100 (100)
I1119 17:55:45.466770 14339 layer_factory.hpp:75] Creating layer conv1
I1119 17:55:45.466789 14339 net.cpp:99] Creating Layer conv1
I1119 17:55:45.466796 14339 net.cpp:453] conv1 <- data
I1119 17:55:45.466811 14339 net.cpp:409] conv1 -> conv1
I1119 17:55:45.466825 14339 net.cpp:131] Setting up conv1
I1119 17:55:45.468318 14339 net.cpp:140] Top shape: 100 32 183 61 (35721600)
I1119 17:55:45.468335 14339 layer_factory.hpp:75] Creating layer relu1
I1119 17:55:45.468343 14339 net.cpp:99] Creating Layer relu1
I1119 17:55:45.468348 14339 net.cpp:453] relu1 <- conv1
I1119 17:55:45.468354 14339 net.cpp:396] relu1 -> conv1 (in-place)
I1119 17:55:45.468360 14339 net.cpp:131] Setting up relu1
I1119 17:55:45.468371 14339 net.cpp:140] Top shape: 100 32 183 61 (35721600)
I1119 17:55:45.468375 14339 layer_factory.hpp:75] Creating layer norm1
I1119 17:55:45.468382 14339 net.cpp:99] Creating Layer norm1
I1119 17:55:45.468386 14339 net.cpp:453] norm1 <- conv1
I1119 17:55:45.468391 14339 net.cpp:409] norm1 -> norm1
I1119 17:55:45.468399 14339 net.cpp:131] Setting up norm1
I1119 17:55:45.468408 14339 net.cpp:140] Top shape: 100 32 183 61 (35721600)
I1119 17:55:45.468412 14339 layer_factory.hpp:75] Creating layer pool1
I1119 17:55:45.468422 14339 net.cpp:99] Creating Layer pool1
I1119 17:55:45.468426 14339 net.cpp:453] pool1 <- norm1
I1119 17:55:45.468431 14339 net.cpp:409] pool1 -> pool1
I1119 17:55:45.468438 14339 net.cpp:131] Setting up pool1
I1119 17:55:45.468456 14339 net.cpp:140] Top shape: 100 32 92 31 (9126400)
I1119 17:55:45.468461 14339 layer_factory.hpp:75] Creating layer conv2
I1119 17:55:45.468468 14339 net.cpp:99] Creating Layer conv2
I1119 17:55:45.468472 14339 net.cpp:453] conv2 <- pool1
I1119 17:55:45.468478 14339 net.cpp:409] conv2 -> conv2
I1119 17:55:45.468509 14339 net.cpp:131] Setting up conv2
OpenCV Error: Assertion failed (0 <= roi.x && 0 <= roi.width && roi.x + roi.width <= m.cols && 0 <= roi.y && 0 <= roi.height && roi.y + roi.height <= m.rows) in Mat, file /home/uujjwal/libraries/opencv-3.0.0/modules/core/src/matrix.cpp, line 495
terminate called after throwing an instance of 'cv::Exception'
what():  /home/uujjwal/libraries/opencv-3.0.0/modules/core/src/matrix.cpp:495: error: (-215) 0 <= roi.x && 0 <= roi.width && roi.x + roi.width <= m.cols && 0 <= roi.y && 0 <= roi.height && roi.y + roi.height <= m.rows in function Mat
*** Aborted at 1447952145 (unix time) try "date -d @1447952145" if you are using GNU date ***
PC: @     0x7f9446ec75d7 __GI_raise
*** SIGABRT (@0x9f71700003803) received by PID 14339 (TID 0x7f9430a6f700) from PID 14339; stack trace: ***
I1119 17:55:45.470507 14339 net.cpp:140] Top shape: 100 48 86 29 (11971200)
I1119 17:55:45.470520 14339 layer_factory.hpp:75] Creating layer relu2
I1119 17:55:45.470535 14339 net.cpp:99] Creating Layer relu2
I1119 17:55:45.470541 14339 net.cpp:453] relu2 <- conv2
I1119 17:55:45.470547 14339 net.cpp:396] relu2 -> conv2 (in-place)
I1119 17:55:45.470553 14339 net.cpp:131] Setting up relu2
I1119 17:55:45.470559 14339 net.cpp:140] Top shape: 100 48 86 29 (11971200)
I1119 17:55:45.470563 14339 layer_factory.hpp:75] Creating layer norm2
I1119 17:55:45.470569 14339 net.cpp:99] Creating Layer norm2
I1119 17:55:45.470573 14339 net.cpp:453] norm2 <- conv2
I1119 17:55:45.470580 14339 net.cpp:409] norm2 -> norm2
I1119 17:55:45.470587 14339 net.cpp:131] Setting up norm2
I1119 17:55:45.470593 14339 net.cpp:140] Top shape: 100 48 86 29 (11971200)
I1119 17:55:45.470598 14339 layer_factory.hpp:75] Creating layer pool2
I1119 17:55:45.470604 14339 net.cpp:99] Creating Layer pool2
I1119 17:55:45.470614 14339 net.cpp:453] pool2 <- norm2
I1119 17:55:45.470619 14339 net.cpp:409] pool2 -> pool2
I1119 17:55:45.470625 14339 net.cpp:131] Setting up pool2
I1119 17:55:45.470633 14339 net.cpp:140] Top shape: 100 48 43 15 (3096000)
I1119 17:55:45.470636 14339 layer_factory.hpp:75] Creating layer conv3
I1119 17:55:45.470643 14339 net.cpp:99] Creating Layer conv3
I1119 17:55:45.470648 14339 net.cpp:453] conv3 <- pool2
I1119 17:55:45.470654 14339 net.cpp:409] conv3 -> conv3
I1119 17:55:45.470661 14339 net.cpp:131] Setting up conv3
@     0x7f945ba27130 (unknown)
@     0x7f9446ec75d7 __GI_raise
@     0x7f9446ec8cc8 __GI_abort
@     0x7f94477cb9b5 (unknown)
@     0x7f94477c9926 (unknown)
@     0x7f94477c9953 (unknown)
@     0x7f94477c9b73 (unknown)
I1119 17:55:45.473515 14339 net.cpp:140] Top shape: 100 64 37 13 (3078400)
I1119 17:55:45.473528 14339 layer_factory.hpp:75] Creating layer relu3
I1119 17:55:45.473542 14339 net.cpp:99] Creating Layer relu3
I1119 17:55:45.473546 14339 net.cpp:453] relu3 <- conv3
I1119 17:55:45.473552 14339 net.cpp:396] relu3 -> conv3 (in-place)
I1119 17:55:45.473558 14339 net.cpp:131] Setting up relu3
I1119 17:55:45.473565 14339 net.cpp:140] Top shape: 100 64 37 13 (3078400)
I1119 17:55:45.473569 14339 layer_factory.hpp:75] Creating layer conv4
I1119 17:55:45.473577 14339 net.cpp:99] Creating Layer conv4
I1119 17:55:45.473580 14339 net.cpp:453] conv4 <- conv3
I1119 17:55:45.473587 14339 net.cpp:409] conv4 -> conv4
I1119 17:55:45.473592 14339 net.cpp:131] Setting up conv4
@     0x7f945269b58d cv::error()
@     0x7f945269b6f0 cv::error()
I1119 17:55:45.475775 14339 net.cpp:140] Top shape: 100 64 33 11 (2323200)
I1119 17:55:45.475786 14339 layer_factory.hpp:75] Creating layer relu4
I1119 17:55:45.475793 14339 net.cpp:99] Creating Layer relu4
I1119 17:55:45.475797 14339 net.cpp:453] relu4 <- conv4
I1119 17:55:45.475803 14339 net.cpp:396] relu4 -> conv4 (in-place)
I1119 17:55:45.475810 14339 net.cpp:131] Setting up relu4
I1119 17:55:45.475816 14339 net.cpp:140] Top shape: 100 64 33 11 (2323200)
I1119 17:55:45.475819 14339 layer_factory.hpp:75] Creating layer norm4
I1119 17:55:45.475826 14339 net.cpp:99] Creating Layer norm4
I1119 17:55:45.475829 14339 net.cpp:453] norm4 <- conv4
I1119 17:55:45.475834 14339 net.cpp:409] norm4 -> norm4
I1119 17:55:45.475841 14339 net.cpp:131] Setting up norm4
I1119 17:55:45.475847 14339 net.cpp:140] Top shape: 100 64 33 11 (2323200)
I1119 17:55:45.475852 14339 layer_factory.hpp:75] Creating layer pool4
I1119 17:55:45.475858 14339 net.cpp:99] Creating Layer pool4
I1119 17:55:45.475862 14339 net.cpp:453] pool4 <- norm4
I1119 17:55:45.475867 14339 net.cpp:409] pool4 -> pool4
I1119 17:55:45.475873 14339 net.cpp:131] Setting up pool4
I1119 17:55:45.475880 14339 net.cpp:140] Top shape: 100 64 17 6 (652800)
I1119 17:55:45.475884 14339 layer_factory.hpp:75] Creating layer fc6
I1119 17:55:45.475894 14339 net.cpp:99] Creating Layer fc6
I1119 17:55:45.475899 14339 net.cpp:453] fc6 <- pool4
I1119 17:55:45.475905 14339 net.cpp:409] fc6 -> fc6
I1119 17:55:45.475914 14339 net.cpp:131] Setting up fc6
@     0x7f945253219a cv::Mat::Mat()
@     0x7f945c14a3ea caffe::DataTransformer<>::Transform()
@     0x7f945c1cb478 caffe::DataLayer<>::load_batch()
@     0x7f945c1b0f98 caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@     0x7f945c144d64 caffe::InternalThread::entry()
@     0x7f945bc4124a (unknown)
@     0x7f945ba1fdf5 start_thread
@     0x7f9446f881ad __clone

looks like the errors are not coming from the same thread that reads the model: thus the model itself (I guess) is ok. My guess is that a thread on your lmdb has failed. Can you verify that the lmdb is not corrupted and was built correctly? — Shai, Nov 19 '15 at 17:23
@Shai : I think there should not be a problem with LMDB. I have not checked it deeply, but it is troubling that in the messages which I have posted in my question : There seems to be an error during setting up of conv2 and conv4. It should not be there because the size of my feature maps seems okay. I am also concerned that why despite those errors, the network initialization proceeds further and does not stop right when conv2 is being setup. I think I am missing something obvious, but I am not able to figure it out. — Ujjwal Aryan, Nov 19 '15 at 18:46
since init process did not stop I suspect error came from different thread. also error message does not seem relayed to net init: it is opencv errors. I suspect a **single** error from other thread, this is why error message and stack trace are printed in the middle of log of net init — Shai, Nov 19 '15 at 18:54
I did modify caffe to support rectangular crop_size. However the lmdb part remained the same. Does LMDB have to store the images with the same dimensions as crop_size ? I am not sure but intuition tells me, no. — Ujjwal Aryan, Nov 20 '15 at 07:35
can you create a small dummy lmdb using [`convert_imageset`](https://github.com/BVLC/caffe/blob/master/tools/convert_imageset.cpp) tool? — Shai, Nov 20 '15 at 08:11
I did prepare the LMDB dataset using DIGITS tool. After the creation of the LMDB. It got created without any trouble at all. — Ujjwal Aryan, Nov 20 '15 at 09:20
No. I mean LMDB was created using DIGITS. But afterwards the error continues — Ujjwal Aryan, Nov 20 '15 at 09:26
It seems to me that some of the input images dimensions is lower than the size of the first input layer. (crop_h: 195, crop_w: 65). — Radek Svoboda, Jun 02 '17 at 14:20

score 0 · Answer 1 · edited May 23 '17 at 10:27

Look at the stack trace of the error you got (I put it together from the different places it appeared in your log):

OpenCV Error: Assertion failed (0 <= roi.x && 0 <= roi.width && roi.x + roi.width <= m.cols && 0 <= roi.y && 0 <= roi.height && roi.y + roi.height <= m.rows) in Mat, file /home/uujjwal/libraries/opencv-3.0.0/modules/core/src/matrix.cpp, line 495
terminate called after throwing an instance of 'cv::Exception'
what():  /home/uujjwal/libraries/opencv-3.0.0/modules/core/src/matrix.cpp:495: error: (-215) 0 <= roi.x && 0 <= roi.width && roi.x + roi.width <= m.cols && 0 <= roi.y && 0 <= roi.height && roi.y + roi.height <= m.rows in function Mat
*** Aborted at 1447952145 (unix time) try "date -d @1447952145" if you are using GNU date ***
PC: @     0x7f9446ec75d7 __GI_raise
*** SIGABRT (@0x9f71700003803) received by PID 14339 (TID 0x7f9430a6f700) from PID 14339; stack trace: ***
@     0x7f945ba27130 (unknown)
@     0x7f9446ec75d7 __GI_raise
@     0x7f9446ec8cc8 __GI_abort
@     0x7f94477cb9b5 (unknown)
@     0x7f94477c9926 (unknown)
@     0x7f94477c9953 (unknown)
@     0x7f94477c9b73 (unknown)
@     0x7f945269b58d cv::error()
@     0x7f945269b6f0 cv::error()
@     0x7f945253219a cv::Mat::Mat()
@     0x7f945c14a3ea caffe::DataTransformer<>::Transform()
@     0x7f945c1cb478 caffe::DataLayer<>::load_batch()
@     0x7f945c1b0f98 caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@     0x7f945c144d64 caffe::InternalThread::entry()
@     0x7f945bc4124a (unknown)
@     0x7f945ba1fdf5 start_thread
@     0x7f9446f881ad __clone

You can clearly see that:

This is a single error, and not multiple ones.
It occurred in a thread that is not the main thread (and this is why its error message and trace are printed between the output lines of the main thread building the model).
This error happens when the pre-fetching thread in the data layer tries to transform one of the input images.

What can you do?

You can try and change the prefetch: 4 parameter in your input data layer.
You can try to test with a different input lmdb constructed a new where you can verify that all images in the new dataset are indeed valid.
Reconstruct the input lmdb using convert_imageset utility, not trusting DIGITS to do the job for you.

I will try this ASAP. What is the role of prefetch parameter in this ? — Ujjwal Aryan, Nov 24 '15 at 07:33
@UjjwalAryan the dataset layer uses threading to prepare the data in advance for processing, fetching it from disk before it is needed so that training will not be delayed by disk access time. — Shai, Nov 24 '15 at 07:36

score -1 · Answer 2 · answered Nov 24 '15 at 02:51

-1

you can read the tutorial in the following page,there is the layer's explanation to show how inner_product layer be computed. You can check your dimensions of the FC6 or debug the code to trace the relevant variables.http://caffe.berkeleyvision.org/tutorial/layers.html

answered Nov 24 '15 at 02:51

fei

1
3

if you read the error message closely you see it has nothing to do with the layers and the InnerProduct dimensions. – Shai Nov 24 '15 at 07:35

Trouble with Caffe network initialization

2 Answers2