Number of instances per class in pytorch dataset

Question

I'm trying to make a simple image classifier using PyTorch. This is how I load the data into a dataset and dataLoader:

batch_size = 64
validation_split = 0.2
data_dir = PROJECT_PATH+"/categorized_products"
transform = transforms.Compose([transforms.Grayscale(), CustomToTensor()])

dataset = ImageFolder(data_dir, transform=transform)

indices = list(range(len(dataset)))

train_indices = indices[:int(len(indices)*0.8)] 
test_indices = indices[int(len(indices)*0.8):]

train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(test_indices)

train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=train_sampler, num_workers=16)
test_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=test_sampler, num_workers=16)

I want to print out the number of images in each class in training and test data separately, something like this:

In train data:

shoes: 20
shirts: 14

In test data:

shoes: 4
shirts: 3

I tried this:

from collections import Counter
print(dict(Counter(sample_tup[1] for sample_tup in dataset.imgs)))

but I got this error:

AttributeError: 'MyDataset' object has no attribute 'img'

possible solution: https://discuss.pytorch.org/t/finding-number-of-samples-per-class-in-multi-label-classification/28261/2 — Mehrdad Salimi, Jun 11 '20 at 08:07

kHarshit · Accepted Answer · 2020-06-11T12:05:18.747

15

You need to use .targets to access the labels of data i.e.

print(dict(Counter(dataset.targets)))

It'll print something like this (e.g. in MNIST dataset):

{5: 5421, 0: 5923, 4: 5842, 1: 6742, 9: 5949, 2: 5958, 3: 6131, 6: 5918, 7: 6265, 8: 5851}

Also, you can use .classes or .class_to_idx to get mapping of label id to classes:

print(dataset.class_to_idx)
{'0 - zero': 0,
 '1 - one': 1,
 '2 - two': 2,
 '3 - three': 3,
 '4 - four': 4,
 '5 - five': 5,
 '6 - six': 6,
 '7 - seven': 7,
 '8 - eight': 8,
 '9 - nine': 9}

Edit: Method 1

From the comments, in order to get class distribution of training and testing set separately, you can simply iterate over subset as below:

train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

# labels in training set
train_classes = [label for _, label in train_dataset]
Counter(train_classes)
Counter({0: 4757,
         1: 5363,
         2: 4782,
         3: 4874,
         4: 4678,
         5: 4321,
         6: 4747,
         7: 5024,
         8: 4684,
         9: 4770})

Edit (2): Method 2

Since you've a large dataset, and as you said it takes considerable time to iterate over all training set, there is another way:

You can use .indices of subset, which referes to indices in the original dataset selected for subset.

i.e.

train_classes = [dataset.targets[i] for i in train_dataset.indices]
Counter(train_classes) # if doesn' work: Counter(i.item() for i in train_classes)

edited Jun 11 '20 at 12:05

answered Jun 11 '20 at 07:59

kHarshit

11,362
10
52
71

This seems good but I want to have the count for train and test data separately but even by using `train_loader.dataset.targets` the dataset refers to the original dataset which contains both test and train data. – Amin Bashiri Jun 11 '20 at 09:50
1

You'll have to split original dataset into trainset and testset, then you'll be able to access that (I don't think you can access it from dataloaders) e.g. https://stackoverflow.com/a/51768651/6210807 – kHarshit Jun 11 '20 at 09:53
Then this happens `'Subset' object has no attribute 'targets'` – Amin Bashiri Jun 11 '20 at 10:11
1

Oh, nevertheless, you can simply iterate over the subset then get the classes, check my edit. – kHarshit Jun 11 '20 at 10:37
thanks for your help but my dataset is too big and `[label for _, label in train_dataset]` takes a lot of time to run it has been running for 15minutes or so and no output, this is my dataset `https://drive.google.com/drive/folders/15Kmax4qq0zpfoT_JQBsDUqq4wriB92qo?usp=sharing`, isn't there a faster way? – Amin Bashiri Jun 11 '20 at 11:19
1

I'll check if there is another way, meanwhile, you can do this: as you can get distribution of classes on complete datset using `.targets`, run the above loop on test datset (it'll take less time), and subtract the classes from total dataset, this way you'll get class distribution on training set. – kHarshit Jun 11 '20 at 11:22
1

Yes, this makes sense, I'll be happy to hear about a better way from you. – Amin Bashiri Jun 11 '20 at 11:24
1

@AminBashiri try method 2 (using `.indices`). check edit. – kHarshit Jun 11 '20 at 12:05

score 0 · Answer 2 · answered Feb 23 '22 at 17:45

0

Simple and easy
if you have dataset class which in your case in ImageFolder

dataset = MyDataset() # which in your case in ImageFolder
labels = torch.zeros(num_classes, dtype=torch.long)

for _, target in dataset:
    labels += target

answered Feb 23 '22 at 17:45

Prajot Kuvalekar

5,128
3
21
32

Number of instances per class in pytorch dataset

2 Answers2