5

I am trying to run the training of some models in tensorflow 2 object detection api.

I am using this command:

gcloud ai-platform jobs submit training segmentation_maskrcnn_`date +%m_%d_%Y_%H_%M_%S` \
    --runtime-version 2.1 \
    --python-version 3.7 \
    --job-dir=gs://${MODEL_DIR} \
    --package-path ./object_detection \
    --module-name object_detection.model_main_tf2 \
    --region us-central1 \
    --scale-tier CUSTOM \
    --master-machine-type n1-highcpu-32 \
    --master-accelerator count=4,type=nvidia-tesla-p100 \
    -- \
    --model_dir=gs://${MODEL_DIR} \
    --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

The training job is submitted successfully but when I look at my submitted job on AI platform I notice that it's not using the GPUs! enter image description here

Also, when looking at the logs for my training job, I noticed that in some cases it couldn't open cuda. It would say something like this:

Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64

I was using AI platform for training a few months back and it was successful. I don't know what has changed now! In fact, for my own setup, nothing has changed.

For the record, I am training Mask RCNN now. A few months back I trained Faster RCNN and SSD models.

sniper71
  • 479
  • 1
  • 4
  • 16

1 Answers1

0

Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64

I'm not sure as I couldn't test anyhow. With a quick google search, It appeared that people have encountered this issue for many reasons and the solution is some kind of depends. In SO, there is the same query was asked, and you probably missed it somehow, check it first, here.

Also, check this related issue posted below

After checking with every possible solution, and still remains the issue, then update your query with it.

I think there some mismatches in your Cuda version (CUDA, cuDNN) and tf version, you should check them first in your working environment. Also, ensure you update the Cuda path properly. According to the given error message, you need to make it ensure that the following set properly.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/lib64/
Innat
  • 16,113
  • 6
  • 53
  • 101
  • I think you misunderstood. The tensorflow object detection api has its own configuration on google ai platform (I think it's a docker image built by google cloud team). This means that I don't have access to the code. I am not running the training on my own local machine. Neither on a cloud VM. I am using google ai platform where basically all I have to do is run a command in my terminal using gcloud. – sniper71 Mar 28 '21 at 12:48
  • I see, sorry It didn't help for you. Just tried to provide some info that may help for you. – Innat Mar 28 '21 at 13:25
  • Have you checked this https://stackoverflow.com/questions/66550195/could-not-load-dynamic-library-libcuda-so-1-error-on-google-ai-platform-with-cus – Innat Mar 28 '21 at 13:25
  • 1
    Thank you for your answer Innat. Unfortunately this won't work for me either! Because the suggested solution is for creating your own custom docker image that can run the training on a GPU. I know how to do that. But the problem that I am facing is related to a docker image that is built and maintained by google cloud team. So I can't modify it since I don't have access to it! – sniper71 Mar 28 '21 at 19:00