18

So I need to retrain Tiny YOLO using my own dataset. The model I am using can be found here: keras-yolo3 .

I started training and I get multiple optimizer errors, added the code of the errors to stop confusion. And I noticed the training is going slow even tho it should use the GPU, and after digging a bit I found that this is not using the GPU for training. I should note that on another smaller network which I used for learning training uses GPU so everything is set correctly from that side, and they are no errors of this type when I did that training.

Is this slow and somewhat CPU training because of said errors? How can I fix this does anyone know?

Using TensorFlow backend.
WARNING: Logging before flag parsing goes to stderr.
2019-08-19 09:45:08.057713: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2019-08-19 09:45:08.264577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.8475
pciBusID: 0000:01:00.0
2019-08-19 09:45:08.270723: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-08-19 09:45:08.275827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-19 09:45:09.214197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-19 09:45:09.217605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-08-19 09:45:09.219777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-08-19 09:45:09.222399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4712 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
Create Tiny YOLOv3 model with 6 anchors and 80 classes.
Load weights model_data/tiny_yolo_weights.h5.
Freeze the first 42 layers of total 44 layers.
Train on 8298 samples, val on 922 samples, with batch size 32.
Epoch 1/50
2019-08-19 09:45:19.742610: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 09:45:19.781035: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 09:45:19.935930: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 09:45:20.168936: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 09:45:20.205304: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
258/259 [============================>.] - ETA: 3s - loss: 41.82962019-08-19 10:01:51.053474: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 10:01:51.138957: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 10:01:51.243888: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
259/259 [==============================] - 1078s 4s/step - loss: 41.8008 - val_loss: 35.7122
UrosT
  • 181
  • 1
  • 6
  • Which version of tensorflow are you using ? – ravikt Nov 08 '19 at 05:47
  • While training, what does 'nvidia-smi' command shows? – Ashwin Geet D'Sa Nov 08 '19 at 11:34
  • @ravikt I was using 1.14.0 version of tensorflow (the version that was stable at the time). – UrosT Nov 09 '19 at 08:25
  • @AshwinGeetD'Sa unfortunately cause of some problem with the PC I was doing the training on I can't currently start the training and use said command, as you asked. – UrosT Nov 09 '19 at 08:25
  • what is telling you that no GPU is used for training? When looking at the logs, it seems that GPU is actually used. Someone seems to have the same problem as you here, and found a hacky solution: https://github.com/qqwweee/keras-yolo3/issues/548#issuecomment-547715335 – Zaccharie Ramzi Nov 09 '19 at 17:45
  • You don't have to restart the training. You can just use it in other terminal. – Ashwin Geet D'Sa Nov 10 '19 at 09:47

1 Answers1

22

I have found solution here: https://github.com/tensorflow/tensorrt/issues/118

You have to change lines(140/141) in yolo3/model.py:

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))

to:

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[...,::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[...,::-1], K.dtype(feats))

Also in my case helps decrease batch size from 8 to 4.

Piotr Golinski
  • 990
  • 8
  • 19
  • 7
    For any one like me who is trying hard to figure out what is the difference,the difference is K.cast(grid_shape[::-1] has been changed to K.cast(grid_shape[...,::-1] similarly input_shape has been changed in the second line – abacusreader Mar 04 '20 at 05:32
  • @piotr-golinski Thanks! Where do you change the batch size from 8 to 4? – relizt May 24 '20 at 23:04