4

I use different hardware to benchmark multiple possibilites. The Code runs in a jupyter Notebook.

When i evaluate the different losses i get highly divergent results.

I also checked the full .cfg with cfg.dump() - it is completely consistent.

Detectron2 Parameters:

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/retinanet_R_101_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("dataset_train",)
cfg.DATASETS.TEST = ("dataset_test",)
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/retinanet_R_101_FPN_3x.yaml")  # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025  # 0.00125 pick a good LR
cfg.SOLVER.MAX_ITER = 1200    # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = []        # do not decay learning rate
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512   # faster, and good enough for this toy dataset (default: 512)
#cfg.MODEL.ROI_HEADS.NUM_CLASSES = 25  # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
cfg.MODEL.RETINANET.NUM_CLASSES = 3
# NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.
cfg.OUTPUT_DIR = "/content/drive/MyDrive/Colab_Notebooks/testrun/output"
cfg.TEST.EVAL_PERIOD = 25
cfg.SEED=5

1. Environment: Azure

Microsoft Azure - Machine Learning
STANDARD_NC6
Torch: 1.9.0+cu111

Results:

Results Azure

Training Log: Log Azure


2. Environment: Colab

GoogleColab free

Torch: 1.9.0+cu111 

Results:

Results GoogleColab

Training Log: Log Colab


EDIT:

3. Environment: Ubuntu

Ubuntu 22.04
RTX 3080
Torch: 1.9.0+cu111

Results:

enter image description here

Training Log: https://pastebin.com/PwXMz4hY


New dataset

Issue is not reproducible with a larger dataset:

enter image description here

Natrium2
  • 83
  • 7
  • do you have a third system to test on? perhaps something cpu-only? that'd let you guess which of the two is wrong/broken -- can you run any of the libraries' tests that test numerics for correctness? – Christoph Rackwitz Jun 30 '22 at 13:20
  • Good point - this was also my thought. Will test it tomorrow on a third system ( ubuntu & local gpu ) to have another comparison. Will edit the mainpost after running the third test. – Natrium2 Jun 30 '22 at 13:21
  • fascinating. so your ubuntu environment agrees with azure, so either they're both "broken" or google colab does some kind of magic to improve training... or some libraries differ that you haven't accounted for. – Christoph Rackwitz Jul 04 '22 at 15:10
  • Thats the question at this point.. I will do the test again with a different dataset ( maybe the amount of images had some strange impact ) and will edit results afterwards in this thread. – Natrium2 Jul 05 '22 at 13:18
  • now that you mention it, that looks like what's going on. I would have assumed that you work with identical data. validation loss goes up again, evidence of overfitting. I think you are giving it more data on the colab instance than on the ubuntu or azure instances. – Christoph Rackwitz Jul 05 '22 at 13:22
  • Your assumption is right - i used the same dataset for all environments 20 Images Train - 10 Images Test and no random splitting. Maybe the amount for this testjob was just to low - but think that doesnt explain the different results. I will start a new test tomorrow with 400/200 pre-labeled images and check the results. – Natrium2 Jul 05 '22 at 14:20
  • Seems like the dataset was responsible for this issue - i cant reproduce the issue with a larger dataset, so everything seems to be fine. Find the comparison image in the thread. – Natrium2 Jul 07 '22 at 06:45

0 Answers0