Symptom
After some time of running just fine our backend will stop giving responses for most of its endpoints. It will just start behaving like a blackhole for those. Once in this state, it will stay there if we don't take any action.
Update
We can reproduce this behaviour with a db dump we made when the backend was in the non responding state.
Infrastructure Setup
We are running Play 2.5 in AWS on an EC2 instance behind a loadbalancer with a PostgreSQL database on RDS. We are using slick-pg as our database connector.
What we know
Here a few things we figured out so far.
About the HTTP requests
Our logs and debugging shows us that the requests are passing the filters. Also, we see that for the authentication (we are using Silhoutte for that) the application is able to perform database queries to receive the identity for that request. The controllers action will just never be called, though.
The backend is responing for HEAD
requests. Further logging showed us that it seems that Controllers using injected services (we are using googles guice for that) are the ones whose methods are not being called anymore. Controllers without injected services seem to work fine.
About the EC2 instance
Unfortunately, we are not able to get much information from that one. We are using boxfuse which provides us with an immutable and by that in-ssh-able infrastructure. We are about to change this to a docker based deployment and might be able to provide more information soon. Nevertheless, we have New Relic setup to monitor our servers. We cannot find anything suspicious there. The memory and CPU usages look fine.
Still, this setup gives us a new EC2 instance on every deployment anyway. And even after a redeployment the issue persists at least for most of the times. Eventually it is possible to resolve this with a redeployment.
Even more weird is the fact that we can run the backend locally connected to the Database on AWS and everything will just work fine there.
So it is hard to say for us where the problem lies. It seems the db is not working with any EC2 instance (until it will work with a new one eventually) but with our local machines.
About the Database
The db is the only stateful entity in this setup, so we think the issue somehow should be related to it.
As we have a production and a staging environment, we are able to dump the production db into staging when the later is not working anymore. We found that this indeed resolves the issue immediately. Unfortunately, we were not able to take a snapshot from a somehow corrupt database to dump it into the staging environment and see whether this will break it immediately. We have a snapshot of the db when the backend was not responding anymore. When we dump this to our staging environment the backend will stop responding immediately.
The number of connections to the DB is around 20 according to the AWS console which is normal.
TL;DR
- Our backend starts behaving like a blackhole for some of its endpoints eventually
- The requests are not reaching the controller actions
- A new instance in EC2 might resolve this, but not necessarily
- locally with the very same db everything is working fine
- dumping a working db into it resolves the issue
- CPU and memory usages of the EC2 instances as well as the number of connections to the DB look totally fine
- We can reproduce the behaviour with a db dump we made when the backend was not responding anymore (see Update 2)
- with new slick threadpool settings, we would get ThreadPoolExecutor exceptions from slick after a reboot of the db with a reboot of our ec2 instance afterwards. (see Update 3)
Update 1
Responding to marcospereira:
Take for instance this ApplicationController.scala
:
package controllers
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import akka.actor.ActorRef
import com.google.inject.Inject
import com.google.inject.name.Named
import com.mohiva.play.silhouette.api.Silhouette
import play.api.i18n.{ I18nSupport, MessagesApi }
import play.api.mvc.Action
import play.api.mvc.Controller
import jobs.jobproviders.BatchJobChecker.UpdateBasedOnResourceAvailability
import utils.auth.JobProviderEnv
/**
* The basic application controller.
*
* @param messagesApi The Play messages API.
* @param webJarAssets The webjar assets implementation.
*/
class ApplicationController @Inject() (
val messagesApi: MessagesApi,
silhouette: Silhouette[JobProviderEnv],
implicit val webJarAssets: WebJarAssets,
@Named("batch-job-checker") batchJobChecker: ActorRef
)
extends Controller with I18nSupport {
def index = Action.async { implicit request =>
Future.successful(Ok)
}
def admin = Action.async { implicit request =>
Future.successful(Ok(views.html.admin.index.render))
}
def taskChecker = silhouette.SecuredAction.async {
batchJobChecker ! UpdateBasedOnResourceAvailability
Future.successful(Ok)
}
}
The index
and admin
are just working fine. The taskchecker
will show the weird behaviour, though.
Update 2
We are able to reproduce this issue now! We found that we made a db dump the last time our backend was not responding anymore. When we dump this into our staging database, the backend will stop responding immediately.
We started logging the number of threads now in one of our filters using Thread.getAllStackTraces.keySet.size
and found that there are between 50 and 60 threads running.
Update 3
As @AxelFontaine suggested we enabled MultiAZ Deployment failover for the database. We rebooted the database with failover. Before, during and after the reboot the backend was not responding.
After the reboot we noticed that the number of connections to the db stayed 0. Also, we did not get any logs for authentication anymore (before we did so, the authentication step could even make db requests and would get responses).
After a rebooting of the EC2 instance, we are now getting
play.api.UnexpectedException: Unexpected exception[RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2@76d6ac53 rejected from java.util.concurrent.ThreadPoolExecutor@6ea1d0ce[Running, pool size = 4, active threads = 4, queued tasks = 5, completed tasks = 157]]
(we did not get those before)
for our requests as well as for our background jobs that need to access the db. Our slick settings now include
numThreads = 4
queueSize = 5
maxConnections = 10
connectionTimeout = 5000
validationTimeout = 5000
as suggested here
Update 4
After we got the exceptions as described in Update 3, the backend is now running fine again. We didn't do anything for that. This was the first time the backend would recover from this state without us being involved.