Im facing severe power cuts in my hometown,and i had to restart my training multiple times,any suggestions on how i can resume my training from my last iteration point? I am using caffe,and lmdb files. Thanks in advance
Asked
Active
Viewed 1,233 times
1 Answers
4
Caffe can save a "snapshot" every once in a while. You can resume your training from the last snapshot you have by simply:
$CAFFE_ROOT/build/tools/caffe train -model /path/to/solver.prototxt -snapshot /path/to/latest.solverstate
In your solver.prototxt
you can define how often a snapshot is taken by setting
snapshot: 2500 # take a snapshot every 2500 iterations
The snapshot file is saved to the same location defined by
snapshot_prefix: "/path/to/snaps"
There you can find both .solverstate
and .caffemodel
saved for each 2500 iterations.

Shai
- 111,146
- 38
- 238
- 371
-
Thanks for your time Shai,but i cant seem to find a .snapshot file,but i do have a solverstate and caffemodel file. – Ryan Aug 13 '17 at 08:18
-
@Ryan my bad. it's ".solverstate" and not ".snapshot". please see my edit. – Shai Aug 13 '17 at 08:21
-
Im getting- ` Cannot copy param 0 weights from layer 'conv10'; shape mismatch. Source param shape is 7 512 1 1 (3584); target param shape is 6 512 1 1 (3072). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer.` – Ryan Aug 13 '17 at 08:27
-
what layer am i supposed to rename? – Ryan Aug 13 '17 at 08:29
-
@Ryan you are not suppose to change the net structure between snaps. – Shai Aug 13 '17 at 08:30
-
I havent changed anything yet – Ryan Aug 13 '17 at 08:31
-
@Ryan it seems, from the error message you got, like you did make a change to the net (in layer `conv10`. – Shai Aug 13 '17 at 08:33
-
may be i messed up somewhere,thanks for the help!will let you know if problems occur when i resume training some other time. – Ryan Aug 13 '17 at 08:38
-
I ran a 100 iterations just to check if i got it right, when i cancel the training and resume it giving it the 100_iter file,i get CudaSucess (2 vs 0 ) out of memory error-im not running any other model. – Ryan Aug 13 '17 at 09:27
-
try `nvidia-smi` to see what else is on your GPU. It might be the case that the other process did not die and is just suspended. – Shai Aug 13 '17 at 09:28
-
I checked your related answer to the above question (https://stackoverflow.com/questions/33790366/caffe-check-failed-error-cudasuccess-2-vs-0-out-of-memory),if i change the batch size at the moment-im worrie that i will be changing the net – Ryan Aug 13 '17 at 09:30
-
@Ryan changing batch size should not affect your layer structure. It's a serious bug if it does. – Shai Aug 13 '17 at 09:31
-
1I changed the batch size from 16 to 8- its working fine,thanks for the help – Ryan Aug 13 '17 at 09:43
-
Kindly suggest something for this issue(https://stackoverflow.com/questions/45659270/trouble-training-caffe-model-cudnn-status-internal-error). – Ryan Aug 13 '17 at 10:15
-
I added snapshot at 1022 iteration but it just gets ignored and training starts from 0 iteration – TSR Sep 16 '18 at 18:56