RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! in simple chatbot codes

Question

I made my first Korean chatbot program with python, pytorch and pycharm. It works in my local environment but so slow, So I want to move my codes to Google Colab to make it fast. But I have runtime error : two devices(cuda and cpu) works in same space. I looked for this error and found out that I should upload all of my codes to GPU to work correctly. However, I added .to(device) / .tocuda() something like this for several times but it wasn't worked yet. Please help me. Below this text, this is my whole train codes : Trainer.py and I have problem when call this code to other one. (Import trainer)

import aboutDataSets
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm # 학습 진행률 시각화 1
from time import sleep # 학습 진행률 시각화 2
import re  # 정규식 계산
import os
import urllib.request  # url로 csv파일 받아오기
from torch.utils.data import DataLoader, Dataset
from transformers.optimization import AdamW  # optimizer
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel

Q_TKN = "<usr>"
A_TKN = "<sys>"
BOS = '</s>'
EOS = '</s>'
MASK = '<unused0>'
SENT = '<unused1>'
PAD = '<pad>'

tokenizer = PreTrainedTokenizerFast.from_pretrained("skt/kogpt2-base-v2",
                                                    bos_token=BOS,
                                                    eos_token=BOS,
                                                    unk_token='unk',
                                                    pad_token=PAD,
                                                    mask_token=MASK)
model = GPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2')
urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData.csv",
    filename="ChatBotDataMain.csv",
)

ChatData = pd.read_csv("ChatBotDataMain.csv")
ChatData = ChatData[:300]
# print(ChatData.head())

#dataset 만들기
dataset = aboutDataSets.ChatDataset(ChatData)

batch_size = 32
num_workers = 0

def collate_batch(batch):
    data = [item[0] for item in batch]
    mask = [item[1] for item in batch]
    label = [item[2] for item in batch]
    return torch.LongTensor(data), torch.LongTensor(mask), torch.LongTensor(label)
# 아래 collate_batch 변수때문에 여기 한번 더 호출.

#dataloader 선언
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_set = aboutDataSets.ChatDataset(ChatData, max_len=40)
train_dataLoader = DataLoader(train_set,
                              batch_size=batch_size,
                              num_workers=num_workers,
                              shuffle=True,
                              collate_fn=collate_batch,) 

model.to(device) 
model.train()
lr = 3e-5
criterion = torch.nn.CrossEntropyLoss(reduction='none')
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
epoch = 10
sneg = -1e18  


# 학습 시작

print("::start::")
for epoch in tqdm(range(epoch)): # 시각화를 위한 tqdm library
    for batch_idx, samples in enumerate(train_dataLoader):
        #print(batch_idx, samples)
        optimizer.zero_grad()
        token_ids, mask, label = samples
        out = model(token_ids)
        out = out.logits  # returns a new tensor with the logit of the elements of input
        mask_3d = mask.unsqueeze(dim=2).repeat_interleave(repeats=out.shape[2], dim=2)
        mask_out = torch.where(mask_3d == 1, out, sneg * torch.ones_like(out))
        loss = criterion(mask_out.transpose(2, 1), label)
        avg_loss = loss.sum() / mask.sum() # avg_loss[0] / avg_loss[1] <- loss 정규화
        avg_loss.backward()
        # 학습 끝
        optimizer.step()
print("end")

Your model is on cuda, but you never move either the `dataLoader` or `samples` there. — NotAName, Nov 21 '22 at 05:31

score 0 · Answer 1 · answered Nov 21 '22 at 05:36

0

Replace token_ids, mask, label = samples with token_ids, mask, label = [t.to(device) for t in samples]

This is because the samples generated by the dataloader is on CPU instead of CUDA by default. You have to move them to CUDA before performing forward.

answered Nov 21 '22 at 05:36

Rancho Xia

1

Similar issue: https://stackoverflow.com/questions/66091226/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least/74514148#74514148 – Rancho Xia Nov 21 '22 at 05:39

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! in simple chatbot codes

1 Answers1