Resume training in Caffe from the previous training point

Question

Im facing severe power cuts in my hometown,and i had to restart my training multiple times,any suggestions on how i can resume my training from my last iteration point? I am using caffe,and lmdb files. Thanks in advance

Shai · Accepted Answer · 2017-08-13T08:20:48.633

4

Caffe can save a "snapshot" every once in a while. You can resume your training from the last snapshot you have by simply:

$CAFFE_ROOT/build/tools/caffe train -model /path/to/solver.prototxt -snapshot /path/to/latest.solverstate

In your solver.prototxt you can define how often a snapshot is taken by setting

snapshot: 2500  # take a snapshot every 2500 iterations

The snapshot file is saved to the same location defined by

snapshot_prefix: "/path/to/snaps"

There you can find both .solverstate and .caffemodel saved for each 2500 iterations.

edited Aug 13 '17 at 08:20

answered Aug 13 '17 at 07:57

Shai

111,146
38
238
371

Thanks for your time Shai,but i cant seem to find a .snapshot file,but i do have a solverstate and caffemodel file. – Ryan Aug 13 '17 at 08:18
@Ryan my bad. it's ".solverstate" and not ".snapshot". please see my edit. – Shai Aug 13 '17 at 08:21
Im getting- ` Cannot copy param 0 weights from layer 'conv10'; shape mismatch. Source param shape is 7 512 1 1 (3584); target param shape is 6 512 1 1 (3072). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer.` – Ryan Aug 13 '17 at 08:27
what layer am i supposed to rename? – Ryan Aug 13 '17 at 08:29
@Ryan you are not suppose to change the net structure between snaps. – Shai Aug 13 '17 at 08:30
I havent changed anything yet – Ryan Aug 13 '17 at 08:31
@Ryan it seems, from the error message you got, like you did make a change to the net (in layer `conv10`. – Shai Aug 13 '17 at 08:33
may be i messed up somewhere,thanks for the help!will let you know if problems occur when i resume training some other time. – Ryan Aug 13 '17 at 08:38
I ran a 100 iterations just to check if i got it right, when i cancel the training and resume it giving it the 100_iter file,i get CudaSucess (2 vs 0 ) out of memory error-im not running any other model. – Ryan Aug 13 '17 at 09:27
try `nvidia-smi` to see what else is on your GPU. It might be the case that the other process did not die and is just suspended. – Shai Aug 13 '17 at 09:28
I checked your related answer to the above question (https://stackoverflow.com/questions/33790366/caffe-check-failed-error-cudasuccess-2-vs-0-out-of-memory),if i change the batch size at the moment-im worrie that i will be changing the net – Ryan Aug 13 '17 at 09:30
@Ryan changing batch size should not affect your layer structure. It's a serious bug if it does. – Shai Aug 13 '17 at 09:31
1

I changed the batch size from 16 to 8- its working fine,thanks for the help – Ryan Aug 13 '17 at 09:43
Kindly suggest something for this issue(https://stackoverflow.com/questions/45659270/trouble-training-caffe-model-cudnn-status-internal-error). – Ryan Aug 13 '17 at 10:15
I added snapshot at 1022 iteration but it just gets ignored and training starts from 0 iteration – TSR Sep 16 '18 at 18:56

Resume training in Caffe from the previous training point

1 Answers1