google gcp nodejs app engine randomly goes down and restarts by itself

Question

We have ReactJS + NodeJS backend app running in Google GCP App Engine, but what we are experiencing is server down time for few min randomly. We have info Logs and error logs in code almost every api and every major functions and all catch blocks. I have also put global exception catch block

 process.on('uncaughtException', (error) => {
  Logger.error('--- Exception -----', error);
});

I also have error handler middleware my expressjs

app.use(function(error, req, res, next) {
    if (error && error.stack && error.message) {
      //cluster.worker.disconnect();
      Logger.error('---- Global Exception Handler : error ---- ', error);
    }
    // Any request to this server will get here, and will send an HTTP
    // response with the error message 'woops'
    res.json({ status: 'FAIL', message: error.message });
  });

but whenever server restarts by itself i don't see any exception in logs, it doesn't even appear in process.on('uncaughtException' block also.

One observation is GCP App engine is auto scaling when there is more traffic, but app is going down whenever there more traffic despite auto scaling is happening.

I want to know how to handle this, how to debug this to identify where exactly the issue is, is the issue is at code level if so why none of the exceptions are caught? or the issue is with GCP?

our production server also goes down many times randomly.

I tried using longjohn and setting NODE_DEBUG environment variable to net to get some hint as suggested in other SO Post

Update 1 : We Ran Jmeter Performance test to check whether load on server is the main reason, as Rafael Lemos rightly pointed in comment, in our case "The instance exceeds the maximum memory for its configured instance_class" may be the reason behind server going down, with lot of trail and error, when we made minimum no of instances = 2 we are not observing server going down.

However jmeter tests are all running fine but when we do jmeter test as well as manual testing at a time , we are observing Socket Hangup Exception, this crash happens immediately as soon as we start using front end app, where as jmeter tests access our backend app. Both backend and front end are running in same app engine.

Kindly find our observations below

Now question is Why Socket Hangup is coming, when I explored this, I found a solution in some other SO post, to handle socket hangup exception I have put below code in app.js but its not working

app.use((req, res, next) => {
  if(res.socket.destroyed) {
    console.log('----- socketIsDestroyed-- ', res.socket.destroyed);
    res.end();
  } else {
    console.log('----- socketNotDestroyed-- ', res.socket.destroyed);
    next();
  }
})

And in Google App Engine Sys logs I see following error log multiple times

GET 503  /readiness_check failReason:"app lameducked"

Update2: my app.yaml file , is issue with resource limits ?

runtime: nodejs
env: flex

automatic_scaling:
  min_num_instances: 2
  max_num_instances: 5

resources:
  cpu: .5
  memory_gb: 0.9
  disk_size_gb: 10

env_variables:
  <Many Environment Variables here>

Update3: when we increased resource limit we are observing frequency of server going down is lesser, updated resource limit are as follows. Is it all because of memory limit issues ?

resources:
 cpu: 2
 memory_gb: 7.5
 disk_size_gb: 10

Logs recorded while Socket hangup exception is as below

A 2021-02-25T05:17:16Z 1|dev_client  | Error: socket hang up
 
A 2021-02-25T05:17:16Z 1|dev_client  |     at createHangUpError (_http_client.js:323:15)
 
A 2021-02-25T05:17:16Z 1|dev_client  |     at Socket.socketCloseListener (_http_client.js:364:25)
 
A 2021-02-25T05:17:16Z 1|dev_client  |     at Socket.emit (events.js:203:15)
 
A 2021-02-25T05:17:16Z 1|dev_client  |     at TCP._handle.close (net.js:606:12)


A 2021-02-25T05:17:16Z 1|dev_client  | Error: socket hang up
 
A 2021-02-25T05:17:16Z 1|dev_client  |     at createHangUpError (_http_client.js:323:15)
 
A 2021-02-25T05:17:16Z 1|dev_client  |     at Socket.socketCloseListener (_http_client.js:364:25)
 
A 2021-02-25T05:17:16Z 1|dev_client  |     at Socket.emit (events.js:203:15)
 
A 2021-02-25T05:17:16Z 1|dev_client  |     at TCP._handle.close (net.js:606:12)
 
A 2021-02-25T05:17:16.691Z GET 499 0 B 6 s Firefox 85 /api/volunteercampaigns/6037246c4de6880024f01ff3/prominentVolunteers GET 499 0 B 6 s Firefox 85 
A 2021-02-25T05:17:16.700Z GET 499 0 B 6 s Firefox 85 /api/volunteercampaigns/6037246c4de6880024f01ff3/prominentVolunteers GET 499 0 B 6 s Firefox 85 
A 2021-02-25T05:17:17Z PM2              | App name:dev_client id:1 disconnected
 
A 2021-02-25T05:17:17Z PM2              | App [dev_client:1] exited with code [0] via signal [SIGINT]
 
A 2021-02-25T05:17:17Z PM2              | App [dev_client:1] starting in -cluster mode-
 
A 2021-02-25T05:17:17Z PM2              | App name:dev_client id:1 disconnected
 
A 2021-02-25T05:17:17Z PM2              | App [dev_client:1] exited with code [0] via signal [SIGINT]
 
A 2021-02-25T05:17:17Z PM2              | App [dev_client:1] starting in -cluster mode-
 
A 2021-02-25T05:17:18Z PM2              | App [dev_client:1] online
 
A 2021-02-25T05:17:18Z PM2              | App [dev_client:1] online

Is it possible that your app's instances are going through the following events that might trigger a temporary shutdown described in [this documentation](https://cloud.google.com/appengine/docs/standard/nodejs/how-instances-are-managed#shutdown)? — Ralemos, Feb 24 '21 at 14:26
@RafaelLemos thanks for the link, i think in our case 3rd point, that is "The instance exceeds the maximum memory for its configured instance_class" may be the cause, however we ran jmeter performance test to check whether load on server is the reason, when we made min no of instances = 2 we are not observing down time, earlier it was 1 — Sadanand, Feb 26 '21 at 04:10
Can you share your app.yaml configuration and how do you handle a request (not all, juste an example to understand your processing flow)? — guillaume blaquiere, Mar 04 '21 at 13:04
It's not possible to use less than 1 cpu in the app.yaml configuration. In addition, you don't have specified liveness and readiness probes. Does these probes exist? — guillaume blaquiere, Mar 05 '21 at 08:10
@guillaumeblaquiere issue is same even if I remove cpu: .5 and what changes i have to make for liveness and readiness ? and why the issue socket hangup appears only when there is a load on server — Sadanand, Mar 06 '21 at 09:03
What do you mean by load on server? When you start an instance? Or when the traffic increase? — guillaume blaquiere, Mar 06 '21 at 12:38
@guillaumeblaquiere when the traffic increases, the way we are checking is we run jmeter tests with 100 users for 8 different scenarios, app never goes down, but along with jmeter test, if we start using manually , i mean automation + manual test app goes down suddenly. so one thing is clear, something wrong with front end side — Sadanand, Mar 08 '21 at 03:03
It's like something went wrong with the GAE internal load balancing, like if it wasn't aware of the instance stop. That's why, adding a health check probe could help. Not sure but you should test this! — guillaume blaquiere, Mar 08 '21 at 07:56
An instance cannot handle more than 100 connections: https://cloud.google.com/sql/docs/quotas#:~:text=Each%20App%20Engine%20instance%20running,concurrent%20connections%20to%20an%20instance. In addition the only thing that can is GKE or a VM . All other solutions are limited. App Engine is super expensive. Dont use it — Mitzi, Mar 08 '21 at 16:26
Can you share some info about what the app is doing? if the logging system does not catch the error/exception it means it is at the OS level. This could happen for. example if your disk gets full, if your app downloads/creates local files (even logs) it could fill up the hdd, then the docker behind the app crashes and a new one re-generates - try increasing hdd size or moving to external logging system. Other OS level issues can happen for different reasons. But I remember from a similar issue in my app engine setup. — CloudBalancing, Mar 08 '21 at 21:49
@CloudBalancing you point is valid, even we experienced it when we increased memory and hdd size crash did not come , atleast frequency of crash is reduced. however we are using papertrail as ext logging system, but does PM2 still creates lot of log files despite papertrail is configured ? — Sadanand, Mar 09 '21 at 02:01
It does, but where the logs are kept is configurable - https://pm2.keymetrics.io/docs/usage/log-management/. AFAIK the default goes to stdout/stderr...`papertrail` also keeps some kind of copy on local storage - but I think this is negligible in size - what your app is doing? does it handle any kind of files? maybe even temporary?csv/images/videos/big json dumps/DB? — CloudBalancing, Mar 09 '21 at 09:14
@Sadanand were you able to fix the issue? You can try opening a customer issue in [Google's Issue Tracker](https://issuetracker.google.com/issues/new?component=187191&template=0) to get their official input and instructions in the issue you are facing. — Ralemos, Mar 15 '21 at 16:44
@RafaelLemos as of now we are able to manage it by increasing memory and cpu resource, however issue is not gone totally, but frequency of occurrence is reduced , ultimately the issue was mainly due to memory i think, not very sure about it — Sadanand, Mar 16 '21 at 04:24

google gcp nodejs app engine randomly goes down and restarts by itself

0 Answers0