14

I have a Sagemaker instance running for a while now. I didn't change anything in between, but now I can't see new logs on Cloudwatch anymore. The old logs are still there, but no new ones since 2 days.

The Sagemaker instance is still running. It's just not logging anymore. And as the code didn't change and I don't have anything time-dependant in there, I'm pretty sure I hit a limit. But I don't know which one:

  • The Log group has only one log stream
  • The single log stream has a size of 175MB.

I found CloudWatch Logs Limits and CloudWatch Events Limits, but that didn't help me.

What could be the problem? How can I investigate it?

According to AWS docs this should not happen. The general AWS support did not help.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • I have not worked with SageMaker but I can still give you some pointers which should help debug this. I assume you can get into the EC2 machine for the same. See this before starting https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/QuickStartEC2Instance.html. First I would run `sudo systemctl status awslogsd` to make sure its running. Next I would make sure that the policy to `"arn:aws:logs:*:*:*"` is still active. Next I would run `journalctl -u awslogsd` to see if I find any issues in logs of `awslogsd`. Next if nothing yields I would run `journalctl -f` and monitor anything in logs – Tarun Lalwani Apr 23 '18 at 14:09
  • 1
    I don't think I can login to Sagemaker with a shell.. or at least I don't know how. – Martin Thoma Apr 23 '18 at 14:51
  • There is an option for S3 logs also I believe? Also can you see if some policy issue? – Tarun Lalwani Apr 23 '18 at 18:30
  • I don't know if I can see if there is a policy issue. The point is that it was running for quite a while. The change was surprising to me and I don't think anything was changed on my side. – Martin Thoma Apr 23 '18 at 19:18
  • I got logging working by changing the ARN of role with Cloudwatch access to another role with Cloudwatch access. All of a sudden the logs lit up. – UsamaAmjad Apr 25 '18 at 00:52
  • cliché but did you try restarting it? Our DDB streams get stuck routinely after a stack updates for receiving Lambdas. Whether you hit a limit or not should be easily detectable by calling up AWS support. – Kashyap Apr 25 '18 at 14:32
  • How do I restart Sagemaker? I do I call AWS support? – Martin Thoma Apr 25 '18 at 14:45
  • (Technical support is not in my account plan) – Martin Thoma Apr 25 '18 at 14:45
  • I suspect you have some sort of permissions problem. As a starting point I would enable CloudTrail logging (and send the CloudTrail logs to CloudWatch Logs for ease of searching), then do something to cause SageMaker to generate some log output, and check to see what shows up in the CloudTrail logs. This will help to identify if you're running into a permissions problem. If you see successful calls to write to CloudWatch logs from SageMaker then the problem is with CloudWatch Logs. – Alex Hague Apr 28 '18 at 22:58
  • 1
    Do you have any visibility into what the log files actually look like? Are they rotated and at what frequency? CloudWatch logs agent will ignore a rotated file if the first line (by default) is the same as in the previous file. Can you see what the log files look like and what the CloudWatch logs configuration is? – Dejan Peretin Apr 29 '18 at 09:19
  • @Tartaglia I don't think Sagemaker gives me any insight into that. – Martin Thoma Apr 29 '18 at 10:21
  • "CloudWatch logs agent will ignore a rotated file if the first line (by default) is the same as in the previous file. " - interesting. I have a lot of duplicate lines... I will investigate that. Thank you! – Martin Thoma Apr 29 '18 at 10:22
  • I believe more information needs to be collected to investigate your issue. You can either share your resource ARN here (I believe it is a notebook instance right?), or post your issue on AWS forums as @leopd suggested, then We can follow up through private message over there. Thanks! - an AWS employee – Fan LI Oct 30 '18 at 17:31

2 Answers2

1

First, it doesn't sound like you're doing anything wrong. Logs should just show up in CloudWatch without you having to do anything, without size or time limits. If they start at all, then we know permissions were set up properly -- unless you modified IAM in the middle of the run. If the logs stop mid job, then either the actual job stopped outputting to stdout/stderr for some reason or this is an operational glitch with the service's log processing. Contacting AWS support (here, in the AWS forums, or through tech support) is the right way to deal with this -- giving somebody in AWS the account id and job name will enable them to look into exactly what happened.

Also, sorry this has gone unanswered for so long. Judging by the activity here, it seemed like a lot of people might have hit this problem. But I'm also guessing & hoping that the problem was a temporary internal service glitch that has been resolved. If anybody is still seeing this problem (after October 2018), please leave a comment so we know it still needs attention. Or better yet open a new question (not ideal from an SO perspective, but that's more likely to get somebody's attention at AWS).

Thanks for using Amazon SageMaker, and thanks for the feedback!

-An AWS employee

Leopd
  • 41,333
  • 31
  • 129
  • 167
  • 1
    I asked AWS support, but they were not helpful. They send me a couple of links which basically said that AWS takes care of the logging. After mentioning that this is likely a bug of AWS, they only replied that they are not technical support (which I didn't book). Later, I think I found the problem: I had a lot of identical log messages. Somehow this seems to have caused issues (although I could not see that I hit any limit). Adding a timestamp to each message and making the logging less verbose solved it for me (for now - not sure if this will re-occur) – Martin Thoma Oct 26 '18 at 17:12
  • Does it still repro? And sorry you couldn't get the help you needed at the time -- AWS forums are sometimes a better way to get attention of technical folks, but we're working on watching SO more closely. – Leopd Oct 26 '18 at 17:30
0

I encountered this problem multiple times. It's possible that new LogStream wasn't created after endpoint update (which can be triggered by you, or AWS restarting/updating underlying instances). You should see logStream for every instance that runs/used to run on your endpoint.

Unfortunately, the only way to mitigate it for me was to update endpoint (apply identical EndpointConfiguration that uses same model, for example), basically triggering recreation of instances, and their log streams