4

Symptom

After some time of running just fine our backend will stop giving responses for most of its endpoints. It will just start behaving like a blackhole for those. Once in this state, it will stay there if we don't take any action.

Update

We can reproduce this behaviour with a db dump we made when the backend was in the non responding state.

Infrastructure Setup

We are running Play 2.5 in AWS on an EC2 instance behind a loadbalancer with a PostgreSQL database on RDS. We are using slick-pg as our database connector.

What we know

Here a few things we figured out so far.

About the HTTP requests

Our logs and debugging shows us that the requests are passing the filters. Also, we see that for the authentication (we are using Silhoutte for that) the application is able to perform database queries to receive the identity for that request. The controllers action will just never be called, though.

The backend is responing for HEAD requests. Further logging showed us that it seems that Controllers using injected services (we are using googles guice for that) are the ones whose methods are not being called anymore. Controllers without injected services seem to work fine.

About the EC2 instance

Unfortunately, we are not able to get much information from that one. We are using boxfuse which provides us with an immutable and by that in-ssh-able infrastructure. We are about to change this to a docker based deployment and might be able to provide more information soon. Nevertheless, we have New Relic setup to monitor our servers. We cannot find anything suspicious there. The memory and CPU usages look fine.

Still, this setup gives us a new EC2 instance on every deployment anyway. And even after a redeployment the issue persists at least for most of the times. Eventually it is possible to resolve this with a redeployment.

Even more weird is the fact that we can run the backend locally connected to the Database on AWS and everything will just work fine there.

So it is hard to say for us where the problem lies. It seems the db is not working with any EC2 instance (until it will work with a new one eventually) but with our local machines.

About the Database

The db is the only stateful entity in this setup, so we think the issue somehow should be related to it.

As we have a production and a staging environment, we are able to dump the production db into staging when the later is not working anymore. We found that this indeed resolves the issue immediately. Unfortunately, we were not able to take a snapshot from a somehow corrupt database to dump it into the staging environment and see whether this will break it immediately. We have a snapshot of the db when the backend was not responding anymore. When we dump this to our staging environment the backend will stop responding immediately.

The number of connections to the DB is around 20 according to the AWS console which is normal.

TL;DR

  • Our backend starts behaving like a blackhole for some of its endpoints eventually
  • The requests are not reaching the controller actions
  • A new instance in EC2 might resolve this, but not necessarily
  • locally with the very same db everything is working fine
  • dumping a working db into it resolves the issue
  • CPU and memory usages of the EC2 instances as well as the number of connections to the DB look totally fine
  • We can reproduce the behaviour with a db dump we made when the backend was not responding anymore (see Update 2)
  • with new slick threadpool settings, we would get ThreadPoolExecutor exceptions from slick after a reboot of the db with a reboot of our ec2 instance afterwards. (see Update 3)

Update 1

Responding to marcospereira:

Take for instance this ApplicationController.scala:

package controllers

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future

import akka.actor.ActorRef
import com.google.inject.Inject
import com.google.inject.name.Named
import com.mohiva.play.silhouette.api.Silhouette
import play.api.i18n.{ I18nSupport, MessagesApi }
import play.api.mvc.Action
import play.api.mvc.Controller

import jobs.jobproviders.BatchJobChecker.UpdateBasedOnResourceAvailability
import utils.auth.JobProviderEnv

/**
 * The basic application controller.
 *
 * @param messagesApi The Play messages API.
 * @param webJarAssets The webjar assets implementation.
 */
class ApplicationController @Inject() (
  val messagesApi: MessagesApi,
  silhouette: Silhouette[JobProviderEnv],
  implicit val webJarAssets: WebJarAssets,
  @Named("batch-job-checker") batchJobChecker: ActorRef
)
    extends Controller with I18nSupport {


  def index = Action.async { implicit request =>
    Future.successful(Ok)
  }

  def admin = Action.async { implicit request =>
    Future.successful(Ok(views.html.admin.index.render))
  }


  def taskChecker = silhouette.SecuredAction.async {
    batchJobChecker ! UpdateBasedOnResourceAvailability
    Future.successful(Ok)
  }

}

The index and admin are just working fine. The taskchecker will show the weird behaviour, though.

Update 2

We are able to reproduce this issue now! We found that we made a db dump the last time our backend was not responding anymore. When we dump this into our staging database, the backend will stop responding immediately.

We started logging the number of threads now in one of our filters using Thread.getAllStackTraces.keySet.size and found that there are between 50 and 60 threads running.

Update 3

As @AxelFontaine suggested we enabled MultiAZ Deployment failover for the database. We rebooted the database with failover. Before, during and after the reboot the backend was not responding.

After the reboot we noticed that the number of connections to the db stayed 0. Also, we did not get any logs for authentication anymore (before we did so, the authentication step could even make db requests and would get responses).

After a rebooting of the EC2 instance, we are now getting

play.api.UnexpectedException: Unexpected exception[RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2@76d6ac53 rejected from java.util.concurrent.ThreadPoolExecutor@6ea1d0ce[Running, pool size = 4, active threads = 4, queued tasks = 5, completed tasks = 157]]

(we did not get those before)

for our requests as well as for our background jobs that need to access the db. Our slick settings now include

numThreads = 4
queueSize = 5
maxConnections = 10
connectionTimeout = 5000
validationTimeout = 5000

as suggested here

Update 4

After we got the exceptions as described in Update 3, the backend is now running fine again. We didn't do anything for that. This was the first time the backend would recover from this state without us being involved.

stubbi
  • 233
  • 2
  • 9
  • What are the differences between the controllers that are executing the ones that aren't? Is there a chance that a filter is preventing the controllers to be reached? – marcospereira Sep 04 '17 at 13:36
  • And some more questions: is that happening under heavy load? Which version of Play exactly are you using? Is there some helpful logs? – marcospereira Sep 04 '17 at 13:38
  • The only difference that we can spot between the controller actions that work and the ones that don't is that the actions that don't work use some service that is dependency injected. The ones that work either just reply immediately with a status code or just render a template. – stubbi Sep 04 '17 at 14:21
  • It's not happening under heavy load. We're using Play 2.5.15 We can't see anything helpful in the logs other than the request arriving but no response – stubbi Sep 04 '17 at 14:31
  • 1
    Can you add sample code for controllers with and without injected services? – marcospereira Sep 04 '17 at 14:40
  • I updated the question including one controller with 3 functions of which 2 are working fine and one showing the weird behaviour. – stubbi Sep 04 '17 at 14:48
  • 1
    @stubbi Have you tried a multi-AZ failover of the RDS DB to see if it is the one DB node somehow become unresponsive when handing out connections? – Axel Fontaine Sep 05 '17 at 06:11
  • @AxelFontaine we enabled multiAZ for our staging environment and rebooted the db with failover. We assume that does what you imply here? The backend did not work before nor during that phase nor afterwards. What we noticed though is that after the reboot the number of connections stayed at 0. The backend was still not responding and not loggin any activity anymore for authentication requests to the db (what it did before). After restarting the EC2 instance now with the new slick threadpool settings, we receive a slick exception very fast (see Update 3 of my question) – stubbi Sep 05 '17 at 20:34
  • btw @marcospereira could you read anything from the Controller samples I provided (see [my comment](https://stackoverflow.com/questions/46036979/play-framework-2-5-requests-timing-out-randomly#comment79039333_46036979))? – stubbi Sep 05 '17 at 20:45

2 Answers2

3

It sounds like a thread management issue at first glance. Slick will provide its own execution context for database operations if you are using Slick 3.1, but you do want to manage the queue size so that it maps out to roughly the same size as the database:

myapp = {
  database = {
    driver = org.h2.Driver
    url = "jdbc:h2:./test"
    user = "sa"
    password = ""

    // The number of threads determines how many things you can *run* in parallel
    // the number of connections determines you many things you can *keep in memory* at the same time
    // on the database server.
    // numThreads = (core_count (hyperthreading included))
    numThreads = 4

    // queueSize = ((core_count * 2) + effective_spindle_count)
    // on a MBP 13, this is 2 cores * 2 (hyperthreading not included) + 1 hard disk
    queueSize = 5

    // https://groups.google.com/forum/#!topic/scalaquery/Ob0R28o45eM
    // make larger than numThreads + queueSize
    maxConnections = 10

    connectionTimeout = 5000
    validationTimeout = 5000
  }
}

Also, you may want to use a custom ActionBuilder, and inject a Futures component and add

import play.api.libs.concurrent.Futures._

once you do that, you can add future.withTimeout(500 milliseconds) and time out the future so that an error response will come back. There's an example of a custom ActionBuilder in the Play example:

https://github.com/playframework/play-scala-rest-api-example/blob/2.5.x/app/v1/post/PostAction.scala

class PostAction @Inject()(messagesApi: MessagesApi)(
    implicit ec: ExecutionContext)
    extends ActionBuilder[PostRequest]
    with HttpVerbs {

  type PostRequestBlock[A] = PostRequest[A] => Future[Result]

  private val logger = org.slf4j.LoggerFactory.getLogger(this.getClass)

  override def invokeBlock[A](request: Request[A],
                              block: PostRequestBlock[A]): Future[Result] = {
    if (logger.isTraceEnabled()) {
      logger.trace(s"invokeBlock: request = $request")
    }

    val messages = messagesApi.preferred(request)
    val future = block(new PostRequest(request, messages))

    future.map { result =>
      request.method match {
        case GET | HEAD =>
          result.withHeaders("Cache-Control" -> s"max-age: 100")
        case other =>
          result
      }
    }
  }
}

so you'd add the timeout, metrics (or circuit breaker if the database is down) here.

Will Sargent
  • 4,346
  • 1
  • 31
  • 53
  • 1
    Thanks for your answer! I will still need some time to understand how it all relates. For the first part regarding the slick settings: All we set there is `slick.dbs.default.db.maxConnections = 800` as we found that the AWS db.t2.large defaults max_connections to 856 and changed it from the default to this value when we were facing this issue before. For the remaining settings we use the default ones. The other part I am not sure if I understand how it would help us? Maybe you know how we could reproduce this behaviour at least if it is related to your suggestions? – stubbi Sep 04 '17 at 15:45
  • 2
    @stubbi Don't forget that if your DB instance supports at most 856 connections, as soon as you have a second node you will have two nodes potentially requesting up to 1600 connections (out of a total of 856). This is certainly something to watch out for during blue/green deployments. – Axel Fontaine Sep 05 '17 at 06:07
  • @AxelFontaine that's a good hint! We will reduce the number to at least half of it. Still, the behaviour does not start during the deployments but just randomly to us. – stubbi Sep 05 '17 at 10:14
  • @WillSargent We figured out now that we have around 50-60 Threads running and also that we can reproduce the behaviour with a specific db dumb (see `Update 2` of my question). We are trying now to change the slick threadpool settings as you suggested. The next step afterwards would be to use a custom action builder, where we will need to extend [silhouette's SecuredActionBuilder](https://github.com/mohiva/play-silhouette/blob/master/silhouette/app/com/mohiva/play/silhouette/api/actions/SecuredAction.scala#L184) as we understand it – stubbi Sep 05 '17 at 10:24
  • @WillSargent We found that instead of implementing our own ActionBuilder, we can also add a filter with `Timeout.timeout(actorSystem, 500.milliseconds)(nextFilter(requestHeader))`. This should have the same effect, right? – stubbi Sep 05 '17 at 10:51
  • Yes, they are the same. Futures.timeout just wraps the actor system for you. – Will Sargent Sep 05 '17 at 12:44
  • Alright! So it seems that changing the slick settings as you suggested resolves the issue indeed :yay:! Still, we don't understand how it _exactly_ does it and if it will prevent us from running into this again in the future. – stubbi Sep 05 '17 at 14:22
  • @WillSargent unfortunately, I was too fast. The last deployment was just one of those that would bring the backend up randomly. We dumped again to proof that we could bring it up afterwards but we couldn't. Even with the new slick settings the backend is not responding after the dump. – stubbi Sep 05 '17 at 15:12
2

After some more investigations we found that one of our jobs was generating deadlocks in our database. The issue we were running into is a known bug in the slick version we were using and is reported on github.

So the problem was that we were running DB transactions with .transactionally within a .map of a DBIOAction on too many threads at the same time.

stubbi
  • 233
  • 2
  • 9
  • What do you mean by ".transactionally with a .map"? Do you mean flatMap? Can you illustrate with code? – nafg Apr 15 '18 at 21:28