Creating balanced dataset for YOLO v5 with each image having multiple annotations

Question

Currently using the YOLO v5 code from this https://github.com/ultralytics/yolov5 in which each txt has a line referring to the <object-class> <x> <y> <width> <height> (image below). Each image has multiple classes (in the annotation file below the image has a <object-class> of 0 and 27). The goal is to select an equal number of object-class from the entire dataset. Currently, the issue is that the image comes with the other labels/bounding boxes too. How can I filter so only a certain row is read from the annotation file? Currently have 1000s of annotation files.

image source: COCO json annotation to YOLO txt format

What's the "why" behind what you're trying to do? You could do this by filtering out examples of the over-represented classes, but in practice I've never seen this lead to improved model performance. — Brad Dwyer, Dec 19 '22 at 22:44
Hi @BradDwyer! Thanks for your comment! Basically trying to create a balanced dataset to avoid an overly optimistic accuracy since the model will generally pick the class it sees most often. — maximus, Dec 20 '22 at 14:59
Class imbalance isn't a huge problem with modern models. Eg the COCO dataset has 100x more people than fire hydrants & models seem to do Ok. It's way more valuable to add more examples of your uncommon classes than to cut down on your common classes. — Brad Dwyer, Dec 22 '22 at 17:04
That is a good point with COCO, Do you think that the model will be more bias towards 'people' class then when testing? Would you recommend any reference that talks about this idea? @BradDwyer — maximus, Dec 22 '22 at 19:46
The mean average precision metric measures this already. If the model has a good mAP then it should be Ok. I don't; just personal experience. The only time I've seen it _really_ be a problem is if the model decides "guess nothing" is the best way to maximize its score and therefore doesn't learn to identify a certain thing. This is usually only if something is _severely_ underrepresented (eg a specific rare manufacturing defect that only happens once in every 1000 images). — Brad Dwyer, Dec 23 '22 at 20:30
Ohh thank you @BradDwyer!! Is there any reference you can point me to which also mentions this? — maximus, Jan 09 '23 at 15:57

Creating balanced dataset for YOLO v5 with each image having multiple annotations

0 Answers0