Batch size reduces accuracy of ensemble of pretrained CNNs

Question

I'm trying to implement basic softmax-based voting, which I take a couple of pretrained CNNs, softmax their outputs, add them together and then use argmax as final output.

So I loaded 4 different pretrained CNNs (vgg11, vgg13, vgg16, vgg19) from "chenyaofo/pytorch-cifar-models" that were trained on CIFAR10 -- I didn't train them.

When I iterate over the test set with DataLoader with batch_size=128/256, I get to 94% accuracy;
When I iterate over the test set with batch_size=1, I get to 69% accuracy.

How could it be?

This is the code:

import torch
from tqdm import tqdm
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import torch.nn as nn
import torch

torch.cuda.empty_cache()

model_names = [
        "cifar10_vgg11_bn",
        "cifar10_vgg13_bn",
        "cifar10_vgg16_bn",
        "cifar10_vgg19_bn",
        # "cifar10_resnet56",
]

batch_size = 2

test_transform = transforms.Compose([
                    transforms.ToTensor(),
])

def load_models():
    models = []
    for model_name in model_names:
        model = torch.hub.load("chenyaofo/pytorch-cifar-models", model_name, pretrained=True)
        models.append(model)
    return models

testset = datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=test_transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False)

import torch.nn as nn
import torch

class MyEnsemble(nn.Module):

    def __init__(self, modelA, modelB, modelC, modelD):
        super(MyEnsemble, self).__init__()
        self.modelA = modelA
        self.modelB = modelB
        self.modelC = modelC
        self.modelD = modelD
        # self.modelE = modelE

    def forward(self, x):
        out1 = self.modelA(x)
        out2 = self.modelB(x)
        out3 = self.modelC(x)
        out4 = self.modelD(x)
        # out5 = self.modelE(x)

        # print(out1.shape)

        out1 = torch.softmax(out1, dim=1)
        out2 = torch.softmax(out2, dim=1)
        out3 = torch.softmax(out3, dim=1)
        out4 = torch.softmax(out4, dim=1)

        out = out1 + out2 + out3 + out4

        return out

from EnsembleModule import MyEnsemble
from data import load_models, testloader
import torch
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

models = load_models()

model = MyEnsemble(models[0], models[1], models[2], models[3])

model.to(device)

total = 0
correct = 0
with torch.no_grad():
    for images, labels in tqdm(testloader):
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predictions = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predictions == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

Berriel · Accepted Answer · 2021-08-19T15:17:27.090

1

You're forgetting to call model.eval():

# ...

model.to(device)
model.eval() # <<<<<<<<<<<<<

total = 0
correct = 0
with torch.no_grad():
    for images, labels in tqdm(testloader):
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)

# ...

As your model has BatchNorm layers, batch_size=1 is particularly degrading.

The pre-processing should also follow the one used for training. As you can see in the repository of the author of the model, you should normalize using the following statistics:

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.4914, 0.4822, 0.4465), std=(0.2023, 0.1994, 0.2010))
])

edited Aug 19 '21 at 15:17

answered Aug 19 '21 at 14:07

Berriel

12,659
4
43
67

well now its 30% with batch size = 1 and model.eval()... – JojoHalastra Aug 19 '21 at 14:38
@JojoHalastra and how much did you get with `batch_size=128/256` and `.eval()`? – Berriel Aug 19 '21 at 14:39
with 256 batch size and .eval() I get also 30%... this is really weird.. – JojoHalastra Aug 19 '21 at 14:46
@JojoHalastra I updated with the statistics used by the author of the model you're using, which is slightly different than the usual ones. – Berriel Aug 19 '21 at 15:10
1

@JojoHalastra BTW, the `std` values of Ivan's answer are incorrect for your model. The correct ones are defined [here](https://github.com/chenyaofo/image-classification-codebase/blob/9eb344d237448f96c2ae50c1dfeab7608be768f7/conf/cifar10.conf#L25-L26) – Berriel Aug 19 '21 at 15:25
Agreed, for some reason chenyaofo's stats are different from the ones used by `torchvision`. – Ivan Aug 19 '21 at 15:31

Ivan · Answer 2 · 2021-08-19T14:54:09.623

1

You are using models containing batchnorm layers (indicated by the _bn suffix in the torchvision's model name).

This in turn means the results will depend on the statistics of the current batch. These being different when using batch_size=2 and batch_size=128. When evaluating you should always call the nn.Module.eval function. This makes the layer use running statistics (those learned during training) and not the batch's statistics. Read this post for more information.

Do note calling eval will recursively propagates to all child modules, so you only have to make a single call on yoru ensemble module directly:

model = MyEnsemble(models[0], models[1], models[2], models[3])
model.eval()

This being done, the batch size should have no effect on the performance of your model.

When training, you will need to turn back training mode on with nn.Module.train.

You need to normalize the data based on the dataset's statistics, you can do so in the torchvision preprocessing pipeline:

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261))]))

edited Aug 19 '21 at 14:54

answered Aug 19 '21 at 14:10

Ivan

34,531
8
55
100

well now its 30% with batch size = 1 and model.eval()... – JojoHalastra Aug 19 '21 at 14:38
Did you normalize your test data in the preprocessing? – Ivan Aug 19 '21 at 14:46
it worked! how did you know the normalization's values? – JojoHalastra Aug 19 '21 at 15:17
I found those [here](https://github.com/kuangliu/pytorch-cifar/issues/19) and [here](https://stackoverflow.com/questions/66678052/how-to-calculate-the-mean-and-the-std-of-cifar10-data). But you should use the stats used for training the model. @Berriel below linked the repo with the correct values. They're slightly different to the ones used by the models provided by `torchvision`. – Ivan Aug 19 '21 at 15:29

Batch size reduces accuracy of ensemble of pretrained CNNs

2 Answers2