I am training yolov8 model on cuda using this code :
from ultralytics import YOLO
import torch
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
model = YOLO("yolov8n.pt") # load a pretrained model (recommended for training)
results = model.train(data="data.yaml", epochs=15, workers=0, batch=12)
results = model.val()
model.export(format="onnx")
and I am getting Nan for all losses
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/15 1.74G nan nan nan 51 640: 4%
I have tried training a model on cpu and it worked fine. the problem appeared when I installed cuda and started training on it.
I expected that there was an error reading the data or something but everything works fine.
I think it has something to do with memory because when I decreased the image size for the model it worked fine, but when I increased batch size for the same decreased image size it showed NaN again. so it's a trade of between image size, batch size and memory. I am not sure 100% if that is right. but that is what I figured out by experiment. but if you have good answer for this problem, please share it.