0

I have a semantic segmentation model which I deployed on ml.m4.xlarge I am using invoke_endpoint from inside an AWS Lambda function using the following bit of code.

with open('\tmp\image.jpg', 'rb') as imfile:
    imbytes = imfile.read()

response = runtime.invoke_endpoint(EndpointName = 'xyx', ContentType = 'image/jpeg',
                                   Body = imbytes)

This is when I get the error as mentioned above

Your invocation timed out while waiting for a response from container primary

Does it mean my datapoint is reaching the model endpoint but it's taking too long to do the inference or is my data not even transferring over to the endpoint?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Yash Jain
  • 41
  • 7
  • Please add at least 5-10 lines of this endpoints' logs. From first look it appears that there no worker listening to the requests - ie. container started up successfully but the server failed to start its worker threads. Can you try starting the container locally and see if it starts up? – Rahul Nimbal Nov 25 '21 at 07:12

1 Answers1

1

I found one post in the stackoverflow and its the current issue with aws sagemaker which donot allow the request to run more than a certain period of time.

There is an alternative solution by batching up the process or implement usint Async inference refernce.

  1. Invocation timeouts aws sagemaker
  2. Sagemaker issue #1119
  3. Amazon Sagemaker-asynchronous new inference
Dharman
  • 30,962
  • 25
  • 85
  • 135
Ruben
  • 558
  • 3
  • 15
  • Thanks. I saw these but I want to know what these errors mean. Is my data reaching the endpoint or no? – Yash Jain Nov 19 '21 at 20:28
  • 1
    have you tried looking into your endpoint logs, does it shows anything to you? – Ruben Nov 20 '21 at 04:25
  • Yes, I see things, but mostly gibberish. It says model loaded and after that whatever there is is difficult to understand. For example #metrics { "StartTime": 1637184409.9257314, "EndTime": 1637184913.6898687, "Dimensions": { "Algorithm": "SemanticSegmentationModel", "Host": "UNKNOWN", "Operation": "scoring" }, "Metrics": { "invocations_error.count": { "sum": 1, "count": 1, "min": 1, "max": 1 } } } – Yash Jain Nov 24 '21 at 18:02
  • 1
    metrics did shows that it ran for ~10 minutes. -- "StartTime": 1637184409.9257314, "EndTime": 1637184913.6898687 ie endpoint did hit by the lambda isnt it? – Ruben Nov 26 '21 at 05:35
  • Yeah looks like it did. But the error I am getting is something related to mxnet which I find hard diagnose. I also get a stack trace (10 entries). Here is the mxnet error I get in the logs - mxnet.base.MXNetError: [10:26:14] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x.4276.0/AL2_x86_64/generic-flavor/src/3rdparty/dmlc-core/src/recordio.cc:12: Check failed: size < (1 << 29U) RecordIO only accept record less than 2^29 bytes – Yash Jain Dec 03 '21 at 17:15