Memory keep accumulating while training by PyTorch

Question

I am training a deep learning model using PyTorch. Due to unknown reasons, memory keeps accumulating, which leads to session killed under 30 epochs and underfitting.

Some thoughts here:

Wondering if it's caused by matplotlib so I added plt.close('all'); didn't work
Added gc.collect(); didn't work
Wondering if it's caused by cv2.imwrite(), but don't know how to inspect this. Any suggestions?
PyTorch issues?

others...

model.train()
for epo in range(epoch):
    for i, data in enumerate(trainloader, 0):
        inputs = data
        inputs = Variable(inputs)
        optimizer.zero_grad()

        top = model.upward(inputs + white(inputs))
        outputs = model.downward(top, shortcut = True)


        loss = criterion(inputs, outputs)
        loss.backward()
        optimizer.step()

        # Print generated pictures every 100 iters
        if i % 100 == 0:
            inn = inputs[0].view(128, 128).detach().numpy() * 255
            cv2.imwrite("/home/tk/Documents/recover/" + str(epo) + "_" + str(i) + ".png", inn)

            out = outputs[0].view(128, 128).detach().numpy() * 255
            cv2.imwrite("/home/tk/Documents/recover/" + str(epo) + "_" + str(i) + "_re.png", out)

        # Print loss every 50 iters
        if i % 50 == 0:
            print ('[%d, %5d] loss: %.3f' % (epo, i, loss.item()))

    gc.collect()
    plt.close("all")

===================================================================

20181222 Update

Datasets & DalaLoader

class MSourceDataSet(Dataset):

    def __init__(self, clean_dir):



        for i in cleanfolder:
            with open(clean_dir + '{}'.format(i)) as f:
                clean_list.append(torch.Tensor(json.load(f)))


        cleanblock = torch.cat(clean_list, 0)
        self.spec = cleanblock


    def __len__(self):
        return self.spec.shape[0]


    def __getitem__(self, index): 

        spec = self.spec[index]
        return spec

trainset = MSourceDataSet(clean_dir)
trainloader = torch.utils.data.DataLoader(dataset = trainset,
                                            batch_size = 4,
                                            shuffle = True)

The model is really complicated and long...plus the memory accumulation issue didn't happen before (using the same model), so I will not post it here...

Could you please post your model code? Do you maybe store batch-information in the model? — cleros, Dec 21 '18 at 15:08
@cleros The model is really complicated and long...plus the memory accumulation issue didn't happen before (using the same model), so I will not post it here... But could you possibly give some examples about batch-information storing you mentioned? Thanks a lot. — sealpuppy, Dec 22 '18 at 12:49
Why don't you try tracking memory leaks to avoid guessing (see e.g. [here](https://stackoverflow.com/questions/1435415/python-memory-leaks)) — Mikhail Berlinkov, Dec 22 '18 at 16:26

Memory keep accumulating while training by PyTorch

20181222 Update

Datasets & DalaLoader

0 Answers0