I am trying to run the training of some models in tensorflow 2 object detection api.
I am using this command:
gcloud ai-platform jobs submit training segmentation_maskrcnn_`date +%m_%d_%Y_%H_%M_%S` \
--runtime-version 2.1 \
--python-version 3.7 \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-central1 \
--scale-tier CUSTOM \
--master-machine-type n1-highcpu-32 \
--master-accelerator count=4,type=nvidia-tesla-p100 \
-- \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
The training job is submitted successfully but when I look at my submitted job on AI platform I notice that it's not using the GPUs!
Also, when looking at the logs for my training job, I noticed that in some cases it couldn't open cuda. It would say something like this:
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64
I was using AI platform for training a few months back and it was successful. I don't know what has changed now! In fact, for my own setup, nothing has changed.
For the record, I am training Mask RCNN now. A few months back I trained Faster RCNN and SSD models.