RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

Question

I saved a checkpoint while training on gpu. After reloading the checkpoint and continue training I get the following error:

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    train(model,optimizer,train_loader,val_loader,criteria=args.criterion,epoch=epoch,batch=batch)
  File "main.py", line 71, in train
    optimizer.step()
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

My training code is as follows:

def train(model,optimizer,train_loader,val_loader,criteria,epoch=0,batch=0):
    batch_count = batch
    if criteria == 'l1':
        criterion = L1_imp_Loss()
    elif criteria == 'l2':
        criterion = L2_imp_Loss()
    if args.gpu and torch.cuda.is_available():
        model.cuda()
        criterion = criterion.cuda()

    print(f'{datetime.datetime.now().time().replace(microsecond=0)} Starting to train..')
    
    while epoch <= args.epochs-1:
        print(f'********{datetime.datetime.now().time().replace(microsecond=0)} Epoch#: {epoch+1} / {args.epochs}')
        model.train()
        interval_loss, total_loss= 0,0
        for i , (input,target) in enumerate(train_loader):
            batch_count += 1
            if args.gpu and torch.cuda.is_available():
                input, target = input.cuda(), target.cuda()
            input, target = input.float(), target.float()
            pred = model(input)
            loss = criterion(pred,target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            ....

The saving process happened after finishing each epoch.

torch.save({'epoch': epoch,'batch':batch_count,'model_state_dict': model.state_dict(),'optimizer_state_dict':
                    optimizer.state_dict(),'loss': total_loss/len(train_loader),'train_set':args.train_set,'val_set':args.val_set,'args':args}, f'{args.weights_dir}/FastDepth_Final.pth')

I can't figure why I get this error. args.gpu == True, and I'm passing the model, all data, and loss function to cuda, somehow there is still a tensor on cpu, could anyone figure out what's wrong?

Thanks.

Seems like the issue comes from `criterion(pred, target)`. Can you check `pred.is_cuda` and `target.is_cuda`? — Ivan, Feb 07 '21 at 19:49
It looks like you are calling `.cuda` on your model too late: this needs to be called BEFORE you initialise the optimiser. From the docs: `If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call. In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used`. See the docs [here](https://pytorch.org/docs/stable/optim.html). — UpstatePedro, Aug 10 '21 at 14:40

score 42 · Answer 1 · answered Feb 08 '21 at 06:14

42

There might be an issue with the device parameters are on:

If you need to move a model to GPU via .cuda() , please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.
In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

answered Feb 08 '21 at 06:14

Shai

111,146
38
238
371

11

adding `.cuda()` to the input data solved it for me: `pred = model(x.cuda())` – Nir Jun 06 '22 at 09:40

score 21 · Answer 2 · answered Sep 22 '22 at 09:54

21

Make sure to add .to(device) to both the model and the model inputs.

answered Sep 22 '22 at 09:54

Shirley Ow

383
2
7

11

`model = model.to(device)` – alchemy Sep 26 '22 at 01:34

score 6 · Answer 3 · answered Dec 13 '22 at 16:43

6

For me it worked adding

model.to('cuda')

right after setting my model up:

class Agent:
def __init__(self):
    self.n_game = 0
    self.epsilon = 0 # Randomness
    self.gamma = 0.9 # discount rate
    self.memory = deque(maxlen=MAX_MEMORY) # popleft()
    self.model = Linear_QNet(11,256,3)                         # here
    self.model.to('cuda')                                      # and here
    self.trainer = QTrainer(self.model,lr=LR,gamma=self.gamma)

answered Dec 13 '22 at 16:43

ricksanchezdev

61
1
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 16 '22 at 10:34

Aayush Shah · Answer 4 · 2023-03-24T09:18:32.950

If you are like me who is still facing an issue, then the issue might me related with the "tokenizer". You're taking the model to the GPU but not the tokenized ids!

So, make sure you go by this:

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")
model.to(device)

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) # This line.

Then you can safely make the inference from the model!

score 2 · Answer 5 · answered Sep 12 '22 at 17:31

2

I added below code at the start of the file. It solved my issue

os.environ['CUDA_VISIBLE_DEVICES'] ='0'

answered Sep 12 '22 at 17:31

Aleesha s j

121
1
4

amin jahani · Answer 6 · 2022-11-14T05:14:09.177

2

adding two lines below resolved the issue for me on colab. (add in both saving and loading)

device = torch.device("cuda")
model.cuda()

note: if you are using google colab obviously you should set your colab runtime to GPU

edited Nov 14 '22 at 05:14

answered Nov 14 '22 at 05:08

amin jahani

74
4

score 2 · Answer 7 · answered Nov 21 '22 at 04:11

2

I'm going through the Fast AI 2022 course and trying to use my M1 Max. I've found that at least with some of the Fastbook code, I could set default_device(torch.device("mps")) and it would resolve my problems.

Here is a reusable snippet that I put at the top of the Jupyter Notebooks I've been dabbling in:

# Check that MPS is available
if not torch.backends.mps.is_available():
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")

else:
    print("MPS is available. Setting as default device.")
    mps_device = torch.device("mps")
    default_device(mps_device)

answered Nov 21 '22 at 04:11

PaulMest

12,925
7
53
50

Works for me too, on a MacBook M1, at least for the first chapter. Are there places where it doesn't work for you @paulmest? – Erik B Jan 09 '23 at 09:58
@ErikB yes, lots of places. I decided just to pay $10/month to get a Paperspace account. I was spending too much time yak-shaving on getting it to run on the M1. There are about 40 operations that are unsupported in PyTorch + MPS: https://github.com/pytorch/pytorch/issues/77764. So you're bound to hit one of them eventually. – PaulMest Jan 09 '23 at 23:34
Thanks! I’ll keep an eye on that list – Erik B Jan 11 '23 at 08:34

score 0 · Answer 8 · answered Nov 01 '22 at 19:03

0

this answer of Shirley Ow helped me Make sure to add .to(device) to both the model and the model inputs.

img = torch.from_numpy(img).to(device) # Code in yolov7

answered Nov 01 '22 at 19:03

Genius Mouse

1
1

score 0 · Answer 9 · answered Jan 31 '23 at 05:54

0

I think after you load the model, it is no longer on GPU, try:

model = AutoModelForSequenceClassification.from_pretrained(output_dir).to(device)

answered Jan 31 '23 at 05:54

li2

1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 02 '23 at 06:58

score 0 · Answer 10 · answered Mar 05 '23 at 21:37

This is not the case for this question but for those who are confused getting this error like me, I hadn't moved the pos_weight argument of BCEWithLogitsLoss to device! changing

criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([3]))

to

criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([3]).to(device))

fixed the problem.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

10 Answers10

Linked