0

Currently using the YOLO v5 code from this https://github.com/ultralytics/yolov5 in which each txt has a line referring to the <object-class> <x> <y> <width> <height> (image below). Each image has multiple classes (in the annotation file below the image has a <object-class> of 0 and 27). The goal is to select an equal number of object-class from the entire dataset. Currently, the issue is that the image comes with the other labels/bounding boxes too. How can I filter so only a certain row is read from the annotation file? Currently have 1000s of annotation files.

Multiple annotations image

image source: COCO json annotation to YOLO txt format

torek
  • 448,244
  • 59
  • 642
  • 775
maximus
  • 335
  • 2
  • 16
  • What's the "why" behind what you're trying to do? You could do this by filtering out examples of the over-represented classes, but in practice I've never seen this lead to improved model performance. – Brad Dwyer Dec 19 '22 at 22:44
  • Hi @BradDwyer! Thanks for your comment! Basically trying to create a balanced dataset to avoid an overly optimistic accuracy since the model will generally pick the class it sees most often. – maximus Dec 20 '22 at 14:59
  • 1
    Class imbalance isn't a huge problem with modern models. Eg the COCO dataset has 100x more people than fire hydrants & models seem to do Ok. It's way more valuable to add more examples of your uncommon classes than to cut down on your common classes. – Brad Dwyer Dec 22 '22 at 17:04
  • That is a good point with COCO, Do you think that the model will be more bias towards 'people' class then when testing? Would you recommend any reference that talks about this idea? @BradDwyer – maximus Dec 22 '22 at 19:46
  • 1
    The mean average precision metric measures this already. If the model has a good mAP then it should be Ok. I don't; just personal experience. The only time I've seen it _really_ be a problem is if the model decides "guess nothing" is the best way to maximize its score and therefore doesn't learn to identify a certain thing. This is usually only if something is _severely_ underrepresented (eg a specific rare manufacturing defect that only happens once in every 1000 images). – Brad Dwyer Dec 23 '22 at 20:30
  • Ohh thank you @BradDwyer!! Is there any reference you can point me to which also mentions this? – maximus Jan 09 '23 at 15:57

0 Answers0