keras image preprocessing unbalanced data

Question

All,

I'm trying to use Keras to do image classification on two classes. For one class, I have very limited number of images, say 500. As for the other class, I have almost infinite number of images. So if I want to use keras image preprocessing, how to do that? Ideally, I need something like this. For class one, I feed 500 images and use ImageDataGenerator to get more images. For class two, each time I extract 500 images in sequence from 1000000 image dataset and probably no data augmentation needed. While looking at the example here and also Keras documentation, I found the training folder contains equal number of images for each class by default. So my question is that is there existing APIs for doing this trick? If so, please kindly point it out to me. If not, is there any workaround to this needs?

petezurich · Answer 1 · 2017-06-22T14:31:22.003

11

You have some options.

Option 1

Use the class_weight parameter of the fit() function which is a dictionary mapping classes to a weight value. Lets say you have 500 samples of class 0 and 1500 samples of class 1 than you feed in class_weight = {0:3 , 1:1}. That gives class 0 three times the weight of class 1.

train_generator.classes gives you the proper class names for your weighting.

If you want to calculate this programmatically than you could use scikit-learn´s sklearn.utils.compute_class_weight(): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/class_weight.py

The function looks at the distribution of labels and produces weights to equally penalize under or over-represented classes in the training set.

See also this useful thread here: https://github.com/fchollet/keras/issues/1875

This thread might also be of help: Is it possible to automatically infer the class_weight from flow_from_directory in Keras?

Option 2

You use a dummy training run with a generator where you apply your image augmentation like rotation, scaling, cropping, flipping etc. and save the augmented images for the real training later. By that you can create a bigger or even balanced dataset for your underrepresented class.

In this dummy run you set save_to_dir in the flow_from_directory function to a folder of your choosing and later on only take the images from the class that you need more samples of. You obviously discard any training results since you only use this run to get more data.

edited Jun 22 '17 at 14:31

answered Jun 21 '17 at 05:36

petezurich

9,280
9
43
57

1

First thank you very much for your prompt reply. I took a look at the links about class_weight provided by you. I felt that setting the class_weight is very tricky and there is no standard way to do it. So I have to tune it based on my case. Is my understand right? – Jane Jun 22 '17 at 04:18
1

Second, I wish I could have a small number of images for both classes while doing training for each epoch. If I use augmentation to get more images for class one, then during different epoches, should I do data augmentation again or the model will the the same image multiple times? – Jane Jun 22 '17 at 04:24
Reg. your first comment: Using `class_weight` is actually really easy (and I´m sorry if my lengthy answer might imply otherwise). You just estimate the percentage that each class has and put these values in the dict. Reg. the second comment: If you want to have a balanced data set for training I suggest to do all the image augmentation in a first (dummy run) and save all the images to your harddisc. In the second (real) training round you don´t augment. Otherwise you might double some transformations. – petezurich Jun 22 '17 at 06:54
1

Thank you so much. But I'm wondering that in your original answer, maybe the class_weight = {0:75 , 1:25} instead of class_weight = {0:25 , 1:75} – Jane Jun 22 '17 at 13:23
You´re welcome. And sorry again if I was unclear. Your first class (which has 500 samples in my example and therefore accounts for 25%) becomes class #0 for the dict because we have to begin counting at zero. With `train_generator.classes` you either get the correct index for your dict or, if you have named classes, those class names that you then put into the dict. – petezurich Jun 22 '17 at 13:27
1

And maybe class_weight = {0:3 , 1:1} is good for convergence as I'm concerned the values,like 75 and 25, might be directly used for backpropagation algorithms like SGD. Or the algorithm will calculate the percentage rather than using these numbers? Thank you. – Jane Jun 22 '17 at 13:37
1

Yes, I understand the array index starts from 0. I'm thinking the class 1 (500 images) is a minority class so that we have to set larger weight for it based on the code in the [last link] you gave. (https://stackoverflow.com/questions/42586475/is-it-possible-to-automatically-infer-the-class-weight-from-flow-from-directory) – Jane Jun 22 '17 at 13:46
You are absolutely right. Sorry about that. My mistake. I will correct my answer. – petezurich Jun 22 '17 at 14:32
1

Thank you for the two options you provided. I will experiment the class_weight options though I'm worried that punish the minority class with large weight may cause some regularization. I will be back and post my result later. – Jane Jun 22 '17 at 14:54
@petezurich For a more general imbalanced dataset, with let's say 1000 images in class 'cats' and 2000 images in class 'dogs', I have heard that you could pretty well avoid the training problem by making your batch size very large comapred to the number of images and the probability of a class representation in the batch. For example if I have 1000 cats' and 2000 dogs' it would be very safe to make the batch_size = 200 images, because it is very very probable that even the underrepresented cats will be at least a few times represented in any batch. Is this claim valid or rather not? – NeStack Aug 07 '19 at 13:52

keras image preprocessing unbalanced data

1 Answers1

Linked