Undefined symbol error when trying to load Huggingface's T5

Question

Issue

I tried to load T5 models from the Huggingface transformers library in python as follows

import pytorch
import transformers
from transformers import AutoModelForSeq2SeqLM

plm = AutoModelForSeq2SeqLM.from_pretrained('t5-small')

The AutoModel line results in an error:

File "main.py", line 64, in main plm = AutoModelForSeq2SeqLM.from_pretrained(args.checkpoint) File "/home/abr247/.local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 463, in from_pretrained return model_class.from_pretrained( File "/home/abr247/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2351, in from_pretrained model = cls(config, *model_args, **model_kwargs) File "/home/abr247/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1499, in __init__ self.encoder = T5Stack(encoder_config, self.shared) File "/home/abr247/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 861, in __init__ [T5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)] File "/home/abr247/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 861, in <listcomp> [T5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)] File "/home/abr247/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 646, in __init__ self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias)) File "/home/abr247/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 577, in __init__ self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon) File "/home/abr247/.local/lib/python3.8/site-packages/apex/normalization/fused_layer_norm.py", line 364, in __init__ fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1014, in _gcd_import File "<frozen importlib._bootstrap>", line 991, in _find_and_load File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 657, in _load_unlocked File "<frozen importlib._bootstrap>", line 556, in module_from_spec File "<frozen importlib._bootstrap_external>", line 1166, in create_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed ImportError: /usr/local/lib/python3.8/dist-packages/fused_layer_norm_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN8pybind116detail11type_casterIN3c108ArrayRefIlEEvE4loadENS_6handleEb

I am able to minimally reproduce this error with import fused_layer_norm_cuda, which yields the error

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    import fused_layer_norm_cuda
ImportError: /usr/local/lib/python3.8/dist-packages/fused_layer_norm_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN8pybind116detail11type_casterIN3c108ArrayRefIlEEvE4loadENS_6handleEb

Some details

OS: Debian (on a cluster I don't have admin privileges on)
I'm using a Singularity
- provided by NVIDIA (https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-22-12.html#rel-22-12)
- bootstrapped from docker container
- python 3.8
- CUDA 11.8
- pytorch 1.12.1+cu102

My attempts

I searched for this issue, and found this similar error, but not about fused_layer_norm_cuda; the same error, but while using fairseq, and the answers were not helpful to me; and the exact same issue asked on the NVIDIA/Apex github issues section, but no response was given. ChatGPT suggested I had incompatible Apex.

I tried installing pytorch compiled for a more recent CUDA and installing an up-to-date Apex, and neither solution worked. Here are the commands I used:

singularity exec --nv $container pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio -f https://download.pytorch.org/whl/torch_stable.html

singularity exec --nv $container pip install git+https://github.com/NVIDIA/apex.git

Does anyone have any suggestions for what the issue/solution could be?

score 0 · Answer 1 · answered Jul 20 '23 at 11:50

0

I had a similar problem and I found that pip uninstall apex to remove apex package solved my problem.

More precisely, I had the excact same problem as with fairseq but the solution proposed did not work. When I compared to colab where the code was running, apex was not installed, so I assumed it was not necessary for my use.

answered Jul 20 '23 at 11:50

zegrhtryt

1

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 24 '23 at 10:44

Undefined symbol error when trying to load Huggingface's T5

Issue

Some details

My attempts

1 Answers1