TF2 object detection API issue with resuming training from saved checkpoint

Question

I'm facing an issue with TF2 object detection API that seems to have occurred overnight. I'm trying to resume training from a saved checkpoint and as usual I change the path in the config file to where the checkpoints are before resuming the training, which has always worked.

Today it's throwing this error (see below). For some reason, checkpoint dir and model dir cannot be the same. Now, the big problem is that if I change the model dir, it restarts training from zero and not from the last epoch, so I'm stuck. This only happens in TF2, I also tried with TF1 and works fine.

File "/usr/local/lib/python3.7/dist-packages/object_detection/utils/variables_helper.py", line 230, in ensure_checkpoint_supported (' Please set model_dir to a different path.'))) RuntimeError: Checkpoint dir (/content/drive/MyDrive/Object_detection/training) and model_dir (/content/drive/MyDrive/Object_detection/training) cannot be same. Please set model_dir to a different path.

score 4 · Answer 1 · edited Nov 26 '21 at 06:38

4

'fine_tune_checkpoint' should point to the checkpoints in the 'pre_trained_model' folder;
'model_dir' instead is the directory where YOU are saving your new checkpoints.

There is no need to manually change the folder. If there are any checkpoints in the 'model_dir', training will re-start from that point. If there are no checkpoints, training will start from the checkpoint taken from the 'pre_trained_model' folder.

edited Nov 26 '21 at 06:38

DV82XL

5,350
5
30
59

answered Jun 23 '21 at 07:42

Jotunheim

110
1
11

Hi, thanks for the answer, up until few months ago you had to manually set the checkpoint when you resume the retraining. Then it changed to like you say, it automatically search for the latest checkpoint and if cannot find any it starts from zero. – Luigi Di Carlo Jun 24 '21 at 08:17
To clarify, not specifying or removing "fine_tune_checkpoint" in the config will resume training from the last check point, if there are ckpt files in the model dir. Thanks for your answer. – can.do Jul 14 '21 at 15:26

Armin Ghanbarzadeh · Answer 2 · 2021-06-14T09:36:27.800

I faced the same problem. It said that the model_dir and chechpoint_dir could not be the same, however, if they are different the training would just start from the beginning.

It was due to a recent addition (May 7) of a check at the end of the file "research/object_detection/utils/variables_helper.py":

 if model_dir == checkpoint_path_dir:
    raise RuntimeError(
        ('Checkpoint dir ({}) and model_dir ({}) cannot be same.'.format(
            checkpoint_path_dir, model_dir) +
         (' Please set model_dir to a different path.')))

I managed to fix it by changing it to something like:

 if model_dir == checkpoint_path_dir:
    pass
    # raise RuntimeError(
        # ('Checkpoint dir ({}) and model_dir ({}) cannot be same.'.format(
            # checkpoint_path_dir, model_dir) +
         # (' Please set model_dir to a different path.')))

After cloning the Github repository and before installing the object_detection package.

I believe you could have also changed the clone version, something like (might need some editing to get it working):

import os
import pathlib

# Clone the tensorflow models repository if it doesn't already exist
if "models" in pathlib.Path.cwd().parts:
  while "models" in pathlib.Path.cwd().parts:
    os.chdir('..')
elif not pathlib.Path('models').exists():
  !git clone --depth 1 https://github.com/tensorflow/models
  !git checkout 'master@{2021-05-6 00:00:00}'

TF2 object detection API issue with resuming training from saved checkpoint

2 Answers2