-1

in my CNN for image classification, I get a curious loss and I don't know what's wrong. I'm lucky, if you help me to find the failure. Here is a cutout of my print output and at the end there is my code:

Train Epoch: 1 [0/2048 (0%)]    Loss: 0.654869
Train Epoch: 1 [64/2048 (3%)]   Loss: 0.271722
Train Epoch: 1 [128/2048 (6%)]  Loss: 0.001958
Train Epoch: 1 [192/2048 (9%)]  Loss: 0.003399
Train Epoch: 1 [256/2048 (12%)] Loss: 0.000000
Train Epoch: 1 [320/2048 (16%)] Loss: 0.006664
Train Epoch: 1 [384/2048 (19%)] Loss: 0.000000
Train Epoch: 1 [448/2048 (22%)] Loss: 0.000000
Train Epoch: 1 [512/2048 (25%)] Loss: 0.000000
Train Epoch: 1 [576/2048 (28%)] Loss: 0.000000
Train Epoch: 2 [0/2048 (0%)]    Loss: 173505.656250
Train Epoch: 2 [64/2048 (3%)]   Loss: 0.000000
Train Epoch: 2 [128/2048 (6%)]  Loss: 0.000000
Train Epoch: 2 [192/2048 (9%)]  Loss: 33394.285156
Train Epoch: 2 [256/2048 (12%)] Loss: 0.000000
Train Epoch: 2 [320/2048 (16%)] Loss: 0.000000
Train Epoch: 2 [960/2048 (47%)] Loss: 0.000000
Train Epoch: 2 [1024/2048 (50%)]        Loss: 636908.437500
Train Epoch: 2 [1088/2048 (53%)]        Loss: 32862667387437056.000000
Train Epoch: 2 [1152/2048 (56%)]        Loss: 15723443952412777718762887446528.000000
Train Epoch: 2 [1216/2048 (59%)]        Loss: nan
Train Epoch: 2 [1280/2048 (62%)]        Loss: nan
Train Epoch: 2 [1344/2048 (66%)]        Loss: nan
Train Epoch: 2 [1408/2048 (69%)]        Loss: nan

Here, you see code for the training.

def trainM(epoch):
    model.train()
    for batch_id, (data, target) in enumerate(net.train_data):
        target = torch.LongTensor(target[64*batch_id:64*(batch_id+1)])
        data = Variable(data)
        target = Variable(target)
        optimizer.zero_grad()

        out = model(data)
        criterion = F.nll_loss

        loss = criterion(out,target)
        loss.backward()
        optimizer.step()
       
        print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(epoch,batch_id*len(data), len(net.train_data)*64, 100*batch_id/len(net.train_data), loss.item()))
        

for item in range(1,10):
    trainM(item)

That's the code for neural network and the end there is the dataPrep method for data preparation.

train_data = []
target_list = []
class Netz(nn.Module):
    def __init__(self):
        super(Netz, self).__init__()
        self.conv1 = nn.Conv2d(1, 10,kernel_size=5)
        self.conv2 = nn.Conv2d(10,20, kernel_size = 5)
        self.conv_dropout = nn.Dropout2d()
        self.fc1 = nn.Linear(1050,60)
        self.fc2 = nn.Linear(60,2)
        self.fce = nn.Linear(20,1)
    
    def forward(self,x):
        x = self.conv1(x)
        x = F.max_pool2d(x, 2)
        x = F.relu(x)
        x = self.conv2(x)
        x = self.conv_dropout(x)
        x = F.max_pool2d(x,2)
        x = F.relu(x)
        x = x.reshape(x.shape[0], x.shape[1], -1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        x = self.fce(x.permute(0,2,1)).squeeze(-1)
        return F.log_softmax(x, -1)


def dataPrep(list_of_data, data_path, category, quantity):
    global train_data
    global target_list
    train_data_list = []
    
    transform = transforms.Compose([
    transforms.ToTensor(),
        ])
    
    len_data = len(train_data)
    for item in list_of_data:
        f = random.choice(list_of_data)
        list_of_data.remove(f)
        try:
            img = Image.open(data_path +f)
        except:
            continue
        img_crop = img.crop((310,60,425,240))
        img_tensor = transform(img_crop)
        train_data_list.append(img_tensor)

        if category == True:
            target = 1
        else:
            target = 0
        target_list.append(target)
        
        if len(train_data_list) >=64:
            train_data.append((torch.stack(train_data_list), target_list))
            train_data_list = []
            
        if (len_data*64 + quantity) <= len(train_data)*64:
            break
    return list_of_data
Christian01
  • 307
  • 1
  • 5
  • 19
  • seems like exploding gradients due to overfit. check test loss to verify. there are several possible remedies, such as decreasng learning rate, early stopping, adding weight decay, increasing minibatch size, clipping gradients, using more dropout, enriching the data, etc. this question does not fit well on stack overflow. Try https://ai.stackexchange.com/ or https://datascience.stackexchange.com/ instead. – LudvigH Apr 22 '22 at 08:29

2 Answers2

1

I might also suggest that the network needs to be initialized with random parameters for the convolutional layer weights. By default these weights are 0, which probably means that you end up predicting all one class. This might explain the very low (0) or very high losses (based on the makeup of the particular batch).

DerekG
  • 3,555
  • 1
  • 11
  • 21
  • How can I initialized the weights? There are function like `torch.nn.init.xavier_uniform_`, but do I set with this function the weights or is there a variablie/method of the conv2d to initialize the weights? – Christian01 Apr 25 '22 at 06:11
  • See this post. The function is not a `nn.module` class method, it's a separate `nn` function: https://stackoverflow.com/questions/49433936/how-to-initialize-weights-in-pytorch – DerekG Apr 25 '22 at 12:19
0

You can try multiples approach to solve that.

  1. Try to reduce the learning rate, try between 1e-03 to 1e-04
  2. Clip the gradient, modify your code with something like:
def trainM(epoch):

    ...

    # Backward
    loss.backward()
    torch.nn.utils.clip_grad_norm_(self.net.parameters(), max_norm=1)
    self.optim.step()

    ...

  1. Change data normalization, try both min-max and Z-Score normalization

Other than that, I can see that your model reaches convergence very fast (the loss goes to zero pretty soon), and your task might be too easy. You can then reduce the number of iterations.

Deusy94
  • 733
  • 3
  • 13
  • Thanks for your answer. Is there an easy way to process the min-max and Z-Score normalization on a tensor? I only find complicated math formulas in the web. – Christian01 Apr 22 '22 at 06:14
  • I've edited the answer and added some reference to already answered questions on the topic, you can also check the documentation of Normalize at https://pytorch.org/vision/main/generated/torchvision.transforms.Normalize.html – Deusy94 Apr 22 '22 at 07:34