Python- Unable to Train Tensorflow Model Container in Sagemaker

Question

I'm fairly new to Sagemaker and Docker.I am trying to train my own custom object detection algorithm in Sagemaker using an ECS container. I'm using this repo's files:

https://github.com/svpino/tensorflow-object-detection-sagemaker

I've followed the instructions exactly, and I'm able to run the image in a container perfectly fine on my local machine. But when I push the image to ECS to run in Sagemaker, I get the following message in Cloudwatch:

I understand that for some reason, when deployed to ECS suddenly the image can't find python. At the top of my training script is the text #!/usr/bin/env python. I've tried to run the *which python * command and changed up text to point to #!/usr/local/bin python, but I just get additional errors. I don't understand why this image would work on my local (tested with both docker on windows and docker CE for WSL). Here's a snippet of the docker file:

ARG ARCHITECTURE=1.15.0-gpu
FROM tensorflow/tensorflow:${ARCHITECTURE}-py3

RUN apt-get update && apt-get install -y --no-install-recommends \
        wget zip unzip git ca-certificates curl nginx python

# We need to install Protocol Buffers (Protobuf). Protobuf is Google's language and platform-neutral,  
# extensible mechanism for serializing structured data. To make sure you are using the most updated code,
# replace the linked release below with the latest version available on the Git repository.
RUN curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.10.1/protoc-3.10.1-linux-x86_64.zip
RUN unzip protoc-3.10.1-linux-x86_64.zip -d protoc3
RUN mv protoc3/bin/* /usr/local/bin/
RUN mv protoc3/include/* /usr/local/include/

# Let's add the folder that we are going to be using to install all of our machine learning-related code
# to the PATH. This is the folder used by SageMaker to find and run our code.
ENV PATH="/opt/ml/code:${PATH}"
RUN mkdir -p /opt/ml/code
WORKDIR /opt/ml/code

RUN pip install --upgrade pip
RUN pip install cython
RUN pip install contextlib2
RUN pip install pillow
RUN pip install lxml
RUN pip install matplotlib
RUN pip install flask
RUN pip install gevent
RUN pip install gunicorn
RUN pip install pycocotools

# Let's now download Tensorflow from the official Git repository and install Tensorflow Slim from
# its folder.
RUN git clone https://github.com/tensorflow/models/ tensorflow-models
RUN pip install -e tensorflow-models/research/slim

# We can now install the Object Detection API, also part of the Tensorflow repository. We are going to change
# the working directory for a minute so we can do this easily.
WORKDIR /opt/ml/code/tensorflow-models/research
RUN protoc object_detection/protos/*.proto --python_out=.
RUN python setup.py build
RUN python setup.py install

# If you are interested in using COCO evaluation metrics, you can tun the following commands to add the
# necessary resources to your Tensorflow installation.
RUN git clone https://github.com/cocodataset/cocoapi.git
WORKDIR /opt/ml/code/tensorflow-models/research/cocoapi/PythonAPI
RUN make 
RUN cp -r pycocotools /opt/ml/code/tensorflow-models/research/

# Let's put the working directory back to where it needs to be, copy all of our code, and update the PYTHONPATH
# to include the newly installed Tensorflow libraries.
WORKDIR /opt/ml/code
COPY /code /opt/ml/code

ENV PYTHONPATH=${PYTHONPATH}:tensorflow-models/research:tensorflow-models/research/slim:tensorflow-models/research/object_detection

RUN chmod +x /opt/ml/code/train
CMD ["/bin/bash","-c","chmod +x /opt/ml/code/train && /opt/ml/code/train"]

The FROM is at the top, it's using an official tensorflow docker image at the base. In this case it would be tensorflow/tensorflow:1.15.0-gpu-py3 — Ameer Akashe, Mar 16 '20 at 15:01
https://stackoverflow.com/questions/3655306/ubuntu-usr-bin-env-python-no-such-file-or-directory probably it help you — Brown Bear, Mar 16 '20 at 15:13
I tried this already with no luck. I also tried building and deploying from Ubuntu. — Ameer Akashe, Mar 16 '20 at 15:51
Can you determine which step is failing? It looks like for some reason `pyhon` is not on PATH. — 9000, Mar 16 '20 at 17:41
It's the first step that happens. Sagemaker automatically runs whatever script named train there is, but train is an executable so it has to have the shebang at the top. — Ameer Akashe, Mar 16 '20 at 19:15

Python- Unable to Train Tensorflow Model Container in Sagemaker

0 Answers0