I use different hardware to benchmark multiple possibilites. The Code runs in a jupyter Notebook.
When i evaluate the different losses i get highly divergent results.
I also checked the full .cfg with cfg.dump()
- it is completely consistent.
Detectron2 Parameters:
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/retinanet_R_101_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("dataset_train",)
cfg.DATASETS.TEST = ("dataset_test",)
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/retinanet_R_101_FPN_3x.yaml") # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025 # 0.00125 pick a good LR
cfg.SOLVER.MAX_ITER = 1200 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = [] # do not decay learning rate
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512 # faster, and good enough for this toy dataset (default: 512)
#cfg.MODEL.ROI_HEADS.NUM_CLASSES = 25 # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
cfg.MODEL.RETINANET.NUM_CLASSES = 3
# NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.
cfg.OUTPUT_DIR = "/content/drive/MyDrive/Colab_Notebooks/testrun/output"
cfg.TEST.EVAL_PERIOD = 25
cfg.SEED=5
1. Environment: Azure
Microsoft Azure - Machine Learning
STANDARD_NC6
Torch: 1.9.0+cu111
Results:
Training Log: Log Azure
2. Environment: Colab
GoogleColab free
Torch: 1.9.0+cu111
Results:
Training Log: Log Colab
EDIT:
3. Environment: Ubuntu
Ubuntu 22.04
RTX 3080
Torch: 1.9.0+cu111
Results:
Training Log: https://pastebin.com/PwXMz4hY
New dataset
Issue is not reproducible with a larger dataset: