64

I saved a checkpoint while training on gpu. After reloading the checkpoint and continue training I get the following error:

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    train(model,optimizer,train_loader,val_loader,criteria=args.criterion,epoch=epoch,batch=batch)
  File "main.py", line 71, in train
    optimizer.step()
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

My training code is as follows:

def train(model,optimizer,train_loader,val_loader,criteria,epoch=0,batch=0):
    batch_count = batch
    if criteria == 'l1':
        criterion = L1_imp_Loss()
    elif criteria == 'l2':
        criterion = L2_imp_Loss()
    if args.gpu and torch.cuda.is_available():
        model.cuda()
        criterion = criterion.cuda()

    print(f'{datetime.datetime.now().time().replace(microsecond=0)} Starting to train..')
    
    while epoch <= args.epochs-1:
        print(f'********{datetime.datetime.now().time().replace(microsecond=0)} Epoch#: {epoch+1} / {args.epochs}')
        model.train()
        interval_loss, total_loss= 0,0
        for i , (input,target) in enumerate(train_loader):
            batch_count += 1
            if args.gpu and torch.cuda.is_available():
                input, target = input.cuda(), target.cuda()
            input, target = input.float(), target.float()
            pred = model(input)
            loss = criterion(pred,target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            ....

The saving process happened after finishing each epoch.

torch.save({'epoch': epoch,'batch':batch_count,'model_state_dict': model.state_dict(),'optimizer_state_dict':
                    optimizer.state_dict(),'loss': total_loss/len(train_loader),'train_set':args.train_set,'val_set':args.val_set,'args':args}, f'{args.weights_dir}/FastDepth_Final.pth')

I can't figure why I get this error. args.gpu == True, and I'm passing the model, all data, and loss function to cuda, somehow there is still a tensor on cpu, could anyone figure out what's wrong?

Thanks.

PaulMest
  • 12,925
  • 7
  • 53
  • 50
Ido Do
  • 641
  • 1
  • 5
  • 3
  • Seems like the issue comes from `criterion(pred, target)`. Can you check `pred.is_cuda` and `target.is_cuda`? – Ivan Feb 07 '21 at 19:49
  • 3
    It looks like you are calling `.cuda` on your model too late: this needs to be called BEFORE you initialise the optimiser. From the docs: `If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call. In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used`. See the docs [here](https://pytorch.org/docs/stable/optim.html). – UpstatePedro Aug 10 '21 at 14:40

10 Answers10

42

There might be an issue with the device parameters are on:

If you need to move a model to GPU via .cuda() , please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.
In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • 11
    adding `.cuda()` to the input data solved it for me: `pred = model(x.cuda())` – Nir Jun 06 '22 at 09:40
21

Make sure to add .to(device) to both the model and the model inputs.

Shirley Ow
  • 383
  • 2
  • 7
6

For me it worked adding

model.to('cuda')

right after setting my model up:

class Agent:
def __init__(self):
    self.n_game = 0
    self.epsilon = 0 # Randomness
    self.gamma = 0.9 # discount rate
    self.memory = deque(maxlen=MAX_MEMORY) # popleft()
    self.model = Linear_QNet(11,256,3)                         # here
    self.model.to('cuda')                                      # and here
    self.trainer = QTrainer(self.model,lr=LR,gamma=self.gamma)
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 16 '22 at 10:34
4

If you are like me who is still facing an issue, then the issue might me related with the "tokenizer". You're taking the model to the GPU but not the tokenized ids!

So, make sure you go by this:

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")
model.to(device)

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) # This line.

Then you can safely make the inference from the model!

Aayush Shah
  • 381
  • 2
  • 11
2

I added below code at the start of the file. It solved my issue

os.environ['CUDA_VISIBLE_DEVICES'] ='0'

Aleesha s j
  • 121
  • 1
  • 4
2

adding two lines below resolved the issue for me on colab. (add in both saving and loading)

device = torch.device("cuda")
model.cuda()

note: if you are using google colab obviously you should set your colab runtime to GPU

2

I'm going through the Fast AI 2022 course and trying to use my M1 Max. I've found that at least with some of the Fastbook code, I could set default_device(torch.device("mps")) and it would resolve my problems.

Here is a reusable snippet that I put at the top of the Jupyter Notebooks I've been dabbling in:

# Check that MPS is available
if not torch.backends.mps.is_available():
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")

else:
    print("MPS is available. Setting as default device.")
    mps_device = torch.device("mps")
    default_device(mps_device)
PaulMest
  • 12,925
  • 7
  • 53
  • 50
  • Works for me too, on a MacBook M1, at least for the first chapter. Are there places where it doesn't work for you @paulmest? – Erik B Jan 09 '23 at 09:58
  • @ErikB yes, lots of places. I decided just to pay $10/month to get a Paperspace account. I was spending too much time yak-shaving on getting it to run on the M1. There are about 40 operations that are unsupported in PyTorch + MPS: https://github.com/pytorch/pytorch/issues/77764. So you're bound to hit one of them eventually. – PaulMest Jan 09 '23 at 23:34
  • Thanks! I’ll keep an eye on that list – Erik B Jan 11 '23 at 08:34
0

this answer of Shirley Ow helped me Make sure to add .to(device) to both the model and the model inputs.

img = torch.from_numpy(img).to(device) # Code in yolov7
0

I think after you load the model, it is no longer on GPU, try:

model = AutoModelForSequenceClassification.from_pretrained(output_dir).to(device)

li2
  • 1
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 02 '23 at 06:58
0

This is not the case for this question but for those who are confused getting this error like me, I hadn't moved the pos_weight argument of BCEWithLogitsLoss to device! changing

criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([3]))

to

criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([3]).to(device))

fixed the problem.

fmatt
  • 464
  • 1
  • 5
  • 15