14

I have AWS API gateway setup for a public endpoint with no auth. It connects to a websocket that triggers a Lambda.

I was creating connections with Python's websocket-client lib at https://pypi.org/project/websocket_client/.

I noticed that connections would fail ~10% of the time, and get worse as I increased load. I can't find anywhere that would be throttling me seeing as my general API Gateway settings say Your current account level throttling rate is 10000 requests per second with a burst of 5000 requests.. That’s beside the point that just 2-3 requests per second would trigger issue fairly often.

Meanwhile the failure response would be like {u'message': u'Forbidden', u'connectionId': u'Z2Jp-dR5vHcCJkg=', u'requestId': u'Z2JqAEJRvHcFzvg='}

I went into my CloudWatch log insights and searched for the connection ID and request ID. The log group for the API gateway would find no results with either ID. Yet a search on my Lambda that fires on websocket connect, would have a log with that connection ID. The log showed everything running as expected on our side. The lambda simply runs a MySQL query that fires.

Why would I get a response of forbidden, despite the lambda working as expected?

The existing question over at getting message: forbidden reply from AWS API gateway, seems to address if it's ALWAYS returning forbidden for some private endpoints. Nothing lined up with my use case.

UPDATE

I think this may be related to locust.io, or python, which I'm using to connect every second. I installed https://www.npmjs.com/package/wscat on my machine and am connecting and closing as fast as possible repeatedly. I am not getting a Forbidden message. It's just extra confusing since I'm not sure how the way I connect would randomly spit back a Forbidden message some of the time.

class SocketClient(object):
    def __init__(self, host):
        self.host = host
        self.session_id = uuid4().hex

    def connect(self):
        self.ws = websocket.WebSocket()
        self.ws.settimeout(10)
        self.ws.connect(self.host)

        events.quitting += self.on_close

        data = self.attach_session({})
        return data

    def attach_session(self, payload):
        message_id = uuid4().hex
        start_time = time.time()
        e = None
        try:
            print("Sending payload {}".format(payload))
            data = self.send_with_response(payload)
            assert data['mykey']

        except AssertionError as exp:
            e = exp
        except Exception as exp:
            e = exp
            self.ws.close()
            self.connect()
        elapsed = int((time.time() - start_time) * 1000)
        if e:
            events.request_failure.fire(request_type='sockjs', name='send',
                                        response_time=elapsed, exception=e)
        else:
            events.request_success.fire(request_type='sockjs', name='send',
                                        response_time=elapsed,
                                        response_length=0)
        return data

    def send_with_response(self, payload):
        json_data = json.dumps(payload)

        g = gevent.spawn(self.ws.send, json_data)
        g.get(block=True, timeout=2)
        g = gevent.spawn(self.ws.recv)
        result = g.get(block=True, timeout=10)

        json_data = json.loads(result)
        return json_data
    def on_close(self):
        self.ws.close()

class ActionsTaskSet(TaskSet):
    @task
    def streams(self):
        response = self.client.connect()
        logger.info("Connect Response: {}".format(response))

class WSUser(Locust):
    task_set = ActionsTaskSet
    min_wait = 1000
    max_wait = 3000

    def __init__(self, *args, **kwargs):
        super(WSUser, self).__init__(*args, **kwargs)
        self.client = SocketClient('wss://mydomain.amazonaws.com/endpoint')

enter image description here

Update 2

I have enabled access logs, the one type of log that wasn't there before. I can now see that my lambdas are always getting a 200 with no issue. The 403 is coming from some MESSAGE eventType that doesn't hit an actual routeKey. Not sure where it comes from, but pretty sure finding that answer will solve this.

I was also able to confirm there are no ENI issues.

enter image description here

Dave Stein
  • 8,653
  • 13
  • 56
  • 104

2 Answers2

4

You might be running into some VPC-related limits. See https://winterwindsoftware.com/scaling-lambdas-inside-vpc/. Sounds like you might be running out of ENIs. You could try moving the function to a different VPC. How long does each invocation of the lambda run for? And what language is you lambda written in?

complex
  • 116
  • 4
  • My timeout is set to 6 seconds. The average duration is 70ms. Since my test is running at 1-3 a second and running into this, could I possibly be running out of ENIs? – Dave Stein May 20 '19 at 20:04
  • My current limit is 350 for Network Interfaces and 5 for VPC security groups per elastic network interface. My unreserved account concurrency for a lambda is 1000 – Dave Stein May 20 '19 at 20:09
  • Also, I'm thinking I get a `Forbidden` from gateway during a spinup, but that doesn't explain the many `Forbidden`s I get in less than 6 seconds, after getting success, while running 2 a second. – Dave Stein May 20 '19 at 20:17
  • I see this got 4 upvotes,but I have not hit any limits. Certainly not ENIs, as confirmed by AWS support. My last update, showing the `routeKey` issue should confirm this isn't the issue at hand. – Dave Stein May 28 '19 at 14:24
0

The payload in my example is empty. The API is configured to use $request.body.action to know the routeKey. Connecting makes the default $connect route work.

Adding a proper action in my body made the 403s go away. This is the solution. I was essentially getting 200 responses from the act of connecting and disconnecting, but was getting the 403 whenever my message without a payload went through.

Dave Stein
  • 8,653
  • 13
  • 56
  • 104