I am training a deep learning model using PyTorch. Due to unknown reasons, memory keeps accumulating, which leads to session killed under 30 epochs and underfitting.
Some thoughts here:
Wondering if it's caused by
matplotlib
so I addedplt.close('all')
; didn't workAdded
gc.collect()
; didn't workWondering if it's caused by
cv2.imwrite()
, but don't know how to inspect this. Any suggestions?PyTorch issues?
others...
model.train() for epo in range(epoch): for i, data in enumerate(trainloader, 0): inputs = data inputs = Variable(inputs) optimizer.zero_grad() top = model.upward(inputs + white(inputs)) outputs = model.downward(top, shortcut = True) loss = criterion(inputs, outputs) loss.backward() optimizer.step() # Print generated pictures every 100 iters if i % 100 == 0: inn = inputs[0].view(128, 128).detach().numpy() * 255 cv2.imwrite("/home/tk/Documents/recover/" + str(epo) + "_" + str(i) + ".png", inn) out = outputs[0].view(128, 128).detach().numpy() * 255 cv2.imwrite("/home/tk/Documents/recover/" + str(epo) + "_" + str(i) + "_re.png", out) # Print loss every 50 iters if i % 50 == 0: print ('[%d, %5d] loss: %.3f' % (epo, i, loss.item())) gc.collect() plt.close("all")
===================================================================
20181222 Update
Datasets & DalaLoader
class MSourceDataSet(Dataset):
def __init__(self, clean_dir):
for i in cleanfolder:
with open(clean_dir + '{}'.format(i)) as f:
clean_list.append(torch.Tensor(json.load(f)))
cleanblock = torch.cat(clean_list, 0)
self.spec = cleanblock
def __len__(self):
return self.spec.shape[0]
def __getitem__(self, index):
spec = self.spec[index]
return spec
trainset = MSourceDataSet(clean_dir)
trainloader = torch.utils.data.DataLoader(dataset = trainset,
batch_size = 4,
shuffle = True)
The model is really complicated and long...plus the memory accumulation issue didn't happen before (using the same model), so I will not post it here...