RuntimeError: GET was unable to find an engine to execute this computation when I using Trainer.train() from hungingface

Question

RuntimeError                              Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 train_results = trainer.train()
      2 wandb.finish()

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1538     self.model_wrapped = self.model
   1540 inner_training_loop = find_executable_batch_size(
   1541     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1542 )
-> 1543 return inner_training_loop(
   1544     args=args,
   1545     resume_from_checkpoint=resume_from_checkpoint,
   1546     trial=trial,
   1547     ignore_keys_for_eval=ignore_keys_for_eval,
   1548 )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1789         tr_loss_step = self.training_step(model, inputs)
   1790 else:
-> 1791     tr_loss_step = self.training_step(model, inputs)
   1793 if (
   1794     args.logging_nan_inf_filter
   1795     and not is_torch_tpu_available()
   1796     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1797 ):
   1798     # if loss is nan or inf simply add the average of previous logged losses
   1799     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2539, in Trainer.training_step(self, model, inputs)
   2536     return loss_mb.reduce_mean().detach().to(self.args.device)
   2538 with self.compute_loss_context_manager():
-> 2539     loss = self.compute_loss(model, inputs)
   2541 if self.args.n_gpu > 1:
   2542     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2571, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2569 else:
   2570     labels = None
-> 2571 outputs = model(**inputs)
   2572 # Save past state if it exists
   2573 # TODO: this needs to be fixed and made cleaner later.
   2574 if self.args.past_index >= 0:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/swinv2/modeling_swinv2.py:1274, in Swinv2ForImageClassification.forward(self, pixel_values, head_mask, labels, output_attentions, output_hidden_states, return_dict)
   1266 r"""
   1267 labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
   1268     Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
   1269     config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
   1270     `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
   1271 """
   1272 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-> 1274 outputs = self.swinv2(
   1275     pixel_values,
   1276     head_mask=head_mask,
   1277     output_attentions=output_attentions,
   1278     output_hidden_states=output_hidden_states,
   1279     return_dict=return_dict,
   1280 )
   1282 pooled_output = outputs[1]
   1284 logits = self.classifier(pooled_output)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/swinv2/modeling_swinv2.py:1076, in Swinv2Model.forward(self, pixel_values, bool_masked_pos, head_mask, output_attentions, output_hidden_states, return_dict)
   1069 # Prepare head mask if needed
   1070 # 1.0 in head_mask indicate we keep the head
   1071 # attention_probs has shape bsz x n_heads x N x N
   1072 # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
   1073 # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
   1074 head_mask = self.get_head_mask(head_mask, len(self.config.depths))
-> 1076 embedding_output, input_dimensions = self.embeddings(pixel_values, bool_masked_pos=bool_masked_pos)
   1078 encoder_outputs = self.encoder(
   1079     embedding_output,
   1080     input_dimensions,
   (...)
   1084     return_dict=return_dict,
   1085 )
   1087 sequence_output = encoder_outputs[0]

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/swinv2/modeling_swinv2.py:295, in Swinv2Embeddings.forward(self, pixel_values, bool_masked_pos)
    292 def forward(
    293     self, pixel_values: Optional[torch.FloatTensor], bool_masked_pos: Optional[torch.BoolTensor] = None
    294 ) -> Tuple[torch.Tensor]:
--> 295     embeddings, output_dimensions = self.patch_embeddings(pixel_values)
    296     embeddings = self.norm(embeddings)
    297     batch_size, seq_len, _ = embeddings.size()

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/swinv2/modeling_swinv2.py:353, in Swinv2PatchEmbeddings.forward(self, pixel_values)
    351 # pad the input to be divisible by self.patch_size, if needed
    352 pixel_values = self.maybe_pad(pixel_values, height, width)
--> 353 embeddings = self.projection(pixel_values)
    354 _, _, height, width = embeddings.shape
    355 output_dimensions = (height, width)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
    462 def forward(self, input: Tensor) -> Tensor:
--> 463     return self._conv_forward(input, self.weight, self.bias)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
    455 if self.padding_mode != 'zeros':
    456     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    457                     weight, bias, self.stride,
    458                     _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
    460                 self.padding, self.dilation, self.groups)

RuntimeError: GET was unable to find an engine to execute this computation

I'm not sure that what happen basically this error NOT show when I run my pipeline but a few day ago is appear. How to fix them

Swintransformer from hunggingface : 'microsoft/swinv2-tiny-patch4-window8-256' from transformers import AutoModelForImageClassification, AutoImageProcessor

!pip install transformers==4.26.0 !pip3 install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 !pip install tensorflow --upgrade

How to fix them to working NOT error.........

I had this problem, turns out all that was needed was a reboot (I didn't reboot after driver installation). — BenedictWilkins, Jun 30 '23 at 21:41

score 5 · Answer 1 · answered Mar 24 '23 at 21:16

5

First of all check your CUDA settings. I had exactly the same error today, with ResNet-18, and the source of the problem was incorrect CUDA version. What is worth to examine: what is your OS? look at the output of nvidia-smi look at the output of nvcc --version and check if CUDA version reported is the one expected I also suggest to put print(os.environ) at the beginnig of your script and examine:

cuda related paths in PATH variable used by the script
value of LD_LIBRARY_PATH used by the script

answered Mar 24 '23 at 21:16

mch

149
1
3

I installed gpu driver following google GCP instruction [here](https://github.com/GoogleCloudPlatform/compute-gpu-installation/tree/main/linux), nvidia-smi returns `NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0` while nvcc returns `Cuda compilation tools, release 10.1, V10.1.243`. – B.Mr.W. Jun 24 '23 at 21:45

score 4 · Answer 2 · edited May 26 '23 at 11:42

4

Compare cuda version returned by nvidia-smi and nvcc --version. They need to match. Then you are good to go.

edited May 26 '23 at 11:42

LuGeNat

179
1
2
13

answered Mar 27 '23 at 14:21

Part

149
1
7

I don't think this is the case. This [popular answer](https://stackoverflow.com/a/53504578/1804173) suggests otherwise, and I've successfully used CUDA with non-matching versions between the two. – bluenote10 Jun 17 '23 at 14:23

score 1 · Answer 3 · answered Apr 05 '23 at 15:52

Updating pytorch to the latest version compatible with the CUDA version displayed in my nvidia-smi solves this issue for me:

type nvidia-smi in terminal
check the version in "CUDA Version"
go to https://pytorch.org/get-started/locally/ to find the command you should type in terminal compatible with your CUDA version - installing with conda is recommended
launch the conda command given to update pytorch

score 1 · Answer 4 · answered Jun 15 '23 at 12:27

1

I had the same error with FIND instead of GET and the reason was just that my GPU was already being used. So it can be misleading.

Run nvidia-smi to make sure it's not the case and run first on an empty GPU that you know the model will fit in.

answered Jun 15 '23 at 12:27

ysig

447
4
18

score 1 · Answer 5 · answered Aug 10 '23 at 07:21

1

In my case, I faced this problem while the gpu does not have enough memory. Reducing the bathsize or training with amp may be helpful to solve this problem.

answered Aug 10 '23 at 07:21

lzx1413

21
1

score 0 · Answer 6 · answered Apr 03 '23 at 19:53

0

got this error after installing tlp, which must have messed my nvidia settings; reverting to 'performance mode' profile solved the issue.

answered Apr 03 '23 at 19:53

Igor Pozdeev

286
2
11

score 0 · Answer 7 · answered Aug 09 '23 at 07:45

Check CUDA version and that should be compatible with torch version otherwise you will get this error.

Here is the link to find suitable CUDA and torch version for installation.
https://pytorch.org/get-started/previous-versions/

In my case, i managed to solve this issue by downgrading pytorch from 2.0.1 to 1.8 and problem is solved.

RuntimeError: GET was unable to find an engine to execute this computation when I using Trainer.train() from hungingface

7 Answers7