2

I am having issues getting the Trainer() function in huggingface to actually do anything on Vertex AI workbench notebooks.

I'm totally stumped and have no idea how to even begin to try debug this.

I made this small notebook: https://github.com/andrewm4894/colabs/blob/master/huggingface_text_classification_quickstart.ipynb

If you set framework=pytorch and run it in colab it runs fine.

I wanted to move from colab to something more persistent so tried Vertex AI Workbench notebooks on GCP. I created a user managed notebook (PyTorch:1.11, 8 vCPUs, 30 GB RAM, NVIDIA Tesla T4 x 1) and if i try run the same example notebook in jupyterlab on the notebook it just seems to hang on the Trainer() call and do nothing.

It looks like the GPU is not doing anything either for some reason (it might not be supposed to since i think Trainer() is some pretraining step):

(base) jupyter@pytorch-1-11-20220819-104457:~$ nvidia-smi
Fri Aug 19 09:56:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I found this thread that maybe seems like a similar problem so i played with as many Trainer() args as i could but no luck.

So im kind of totally blocked here - i refactored the code to be able to use Tensorflow which does work for me (after i installed tensorflow on the notebook) but its much slower for some reason.

Basically this was all working great (in my actual real code im working on) on colab's but when i tried to move to Vertex AI Notebooks i seem to be now blocked by this strange issue.

Any help or advice much appreciated, i'm new to HuggingFace and Pytorch etc too so not even sure what things i might try or ways to try run in debug etc maybe.

Workaround

i noticed that if i make a new workbook NumPy/SciPy/scikit-learn 4 vCPUs, 15 GB RAM , NVIDIA Tesla T4 x (instead of the official pytorch one from the dropdown) and install pytorch myself with conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch it all works.

andrewm4894
  • 1,451
  • 4
  • 17
  • 37
  • Does this [document](https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-train-and-tune-pytorch-models-vertex-ai) help you? – Prajna Rai T Aug 20 '22 at 08:45
  • Oh thanks, I don't see anything majorly different but I'll try recreate that lab on my notebook to see if that does or does not work for me as either way that will give some more info. Cheers. – andrewm4894 Aug 20 '22 at 10:11
  • @PrajnaRaiT i just tried to use that notebook and got the same behaviour where the `Trainer()` step just seems to do nothing on a fresh vertex notebook instance but same notebook runs fine on colab. I feel like this could be a bug i need to open with the vertex team maybe. https://colab.research.google.com/drive/171GmwE0QrNk9DWmxck3MuOeoOa167c7S?usp=sharing – andrewm4894 Aug 22 '22 at 10:23
  • i have created this issue as i think it's a potential bug somewhere on vertex workbench so wanted to try raise it with someone. https://issuetracker.google.com/issues/243267023 – andrewm4894 Aug 22 '22 at 10:31
  • i noticed that if i make a new workbook `NumPy/SciPy/scikit-learn 4 vCPUs, 15 GB RAM , NVIDIA Tesla T4 x` and install pytorch myself with `conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch` it all works. – andrewm4894 Aug 23 '22 at 16:12

0 Answers0