I got NaN for all losses while training YOLOv8 model

Question

I am training yolov8 model on cuda using this code :

from ultralytics import YOLO
import torch
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
model = YOLO("yolov8n.pt")  # load a pretrained model (recommended for training)
results = model.train(data="data.yaml", epochs=15, workers=0, batch=12)  
results = model.val()  
model.export(format="onnx")

and I am getting Nan for all losses

Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
1/15      1.74G        nan        nan        nan         51        640:   4%

I have tried training a model on cpu and it worked fine. the problem appeared when I installed cuda and started training on it.

I expected that there was an error reading the data or something but everything works fine.

I think it has something to do with memory because when I decreased the image size for the model it worked fine, but when I increased batch size for the same decreased image size it showed NaN again. so it's a trade of between image size, batch size and memory. I am not sure 100% if that is right. but that is what I figured out by experiment. but if you have good answer for this problem, please share it.

score 0 · Answer 1 · answered Feb 21 '23 at 16:23

I had a similar issue but found that it went away when I upgraded to the most recent version of ultralytics. Everything was working in an environment with ultralytics 8.0.26, and then I saw the NaN loss issue in an environment with 8.0.30-something. Creating a new environment with ultralytics 8.0.42 seemed to solve the problem.

score 0 · Answer 2 · answered Feb 25 '23 at 22:23

I was having the same problem trying to train to my custom dataset. As someone else here recommended, I also tried downgrading the ultralytics version to 8.0.42, but that didn't work. What did fix it was to run the command as below:

yolo task=detect mode=train model=yolov8s.pt data="./data/data.yaml" epochs=50 batch=8 imgsz=640 device=0 workers=8 optimizer=Adam pretrained=true dropout=true val=true plots=true half=true save=True show=true save_txt=true save_conf=true save_crop=true optimize=true lr0=0.001 lrf=0.01 fliplr=0.0

Try opening the args file (runs\detectrain\args.yaml) and keep changing the parameters based on what is available there or in docs (https://docs.ultralytics.com/cfg/), maybe at some point you can solve the problem. I believe that the main parameter you should try to change is the device to "cpu".

I suspect that the problem may be with GTX16 series as discussed here (https://github.com/ultralytics/ultralytics/issues/1148).

score 0 · Answer 3 · answered Mar 20 '23 at 05:52

0

set batch=2 try again, I sovled the problem by this way

answered Mar 20 '23 at 05:52

user21376381

1
1

score 0 · Answer 4 · answered May 08 '23 at 13:57

I had the same issue. Even after upgrading ultralytics to its latest version 8.0.94 and setting the batch size to a lower value, it did not help me. When I set the device to CPU device=cpu, it works perfectly fine.

so, the problem was mainly with the GPU. As suggested by the github issue, setting amp=False fixed it and I was able to run it on GPU.

yolo task=detect mode=train model=yolov8s.pt data="data.yaml" epochs=20 batch=2 imgsz=640 device=0 workers=8 optimizer=Adam pretrained=true val=true plots=true save=True show=true optimize=true lr0=0.001 lrf=0.01 fliplr=0.0 amp=False

I got NaN for all losses while training YOLOv8 model

4 Answers4