20

We are seeing inconsistent performance on Heroku that is unrelated to the recent unicorn/intelligent routing issue.

This is an example of a request which normally takes ~150ms (and 19 out of 20 times that is how long it takes). You can see that on this request it took about 4 seconds, or between 1 and 2 orders of magnitude longer.

enter image description here

Some things to note:

  • the database was not the bottleneck, and it spent only 25ms doing db queries
  • we have more than sufficient dynos, so I don't think this was the bottleneck (20 double dynos running unicorn with 5 workers each, we get only 1000 requests per minute, avg response time of 150ms, which means we should be able to serve (60 / 0.150) * 20 * 5 = 40,000 requests per minute. In other words we had 40x the capacity on dynos when this measurement was taken.

So I'm wondering what could cause these occasional slow requests. As I mentioned, anecdotally it seems to happen in about 1 in 20 requests. The only thing I can think of is there is a noisy neighbor problem on the boxes, or the routing layer has inconsistent performance. If anyone has additional info or ideas I would be curious. Thank you.

Brian Armstrong
  • 19,707
  • 17
  • 115
  • 144
  • If it's any consolation, I'm experiencing this too and haven't been able to track down an application specific reason for it. Another data point for the noisy neighbour theory? – Craig Timpany May 15 '13 at 07:16
  • Have you tried a support ticket with Heroku? – Neil Middleton Sep 27 '13 at 14:02
  • Revisiting this after a long while, here's a *crazy* theory: it's something to do with Heroku's "fair CPU share" and the virtualization they use to accomplish this. To prove my point, you could write a small piece of code, say a handler for the `/test` url. In it, count to a several millions. Just burn CPU. It *that* ends up taking sometime 150ms and sometime 4sec, then it's Heroku. – Nitzan Shaked Oct 02 '13 at 19:38
  • I know I am a little late to the party but I have seen similar issues recently running Puma on heroku. Even for methods returning simple 304 s not modified the spikes can happen up to 1000MS. New Relic reports no latency spikes in the routing, or between the app and one of the databases, SQL or NOSQL. GC also doesn't seem to spike. Its just a giant spike in the controller action... Anyone else figure this out or is this just a heroku issue? – rovermicrover Jul 19 '14 at 00:01
  • @rovermicrover I think I have the same issue with Puma on Heroku, occasionally, I get a request that takes 500ms+ (all the server does is write to redis, so its usually 20ms), any advice? – Sam Dec 22 '14 at 21:22
  • @Sam Nope not really. The only response I got out of heroku was that they restart each dyno at least once a day. They suggested what I was seeing was the application booting up on the dyno. – rovermicrover Dec 29 '14 at 22:15
  • @rovermicrover I think my delays are due to Major GC runs. Gonna try 2.2.0 and see if the incremental GC helps. – Sam Dec 29 '14 at 22:18
  • @Sam I have new relic set up to show GC runs on my end and that is not the issue for me at least. – rovermicrover Dec 29 '14 at 22:19
  • An update on my end running ruby 2.2.2 now, still got the same issues. Saw one delay of about 800ms the other day on a status check method, that does three DB reads and one redis read, and then outputs a JSON hash with three key value pairs. Delay is sporadically spaced out in running code, not on outside requests to databases. It seems like all threads just hang, even if no other requests are happening on instance. – rovermicrover May 04 '15 at 21:25
  • 1
    According to this answer on a similar question heroku is admitting this can happen because a heroku issue with how memory is handled between shared dyno types, ie 1x and 2x. http://stackoverflow.com/a/32465651/1436131. Our issues seem to fit the same patter from what I looked at today with our logs. – rovermicrover Sep 13 '15 at 05:17

1 Answers1

9

I have been chasing a similar problem myself, with not much luck so far.

I suppose the first order of business would to be to recommend NewRelic. It may have some more info for you on these cases.

Second, I suggest you look at queue times: how long your request was queued. Look at NewRelic for this, or do it yourself with the "start time" HTTP header that Heroku adds to your incoming request (just print now() minus "start time" as your queue time).

When those failed me in my case, I tried coming up with things that could go wrong, and here's a (unorthodox? weird?) list:

1) DNS -- are you making any DNS calls in your view? These can take a while. Even DNS requests for resolving DB host names, Redis host names, external service providers, etc.

2) Log performance -- Heroku collects all your stdout using their "Logplex", which it then drains to your own defined logdrains, services such as Papertrail, etc. There is no documentation on the performance of this, and writes to stdout from your process could block, theoretically, for periods while Heroku is flushing any buffers it might have there.

3) Getting a DB connection -- not sure which framework you are using, but maybe you have a connection pool that you are getting DB connections from, and that took time? It won't show up as query time, it'll be blocking time for your process.

4) Dyno performance -- Heroku has an add-on feature that will print, every few seconds, some server metrics (load avg, memory) to stdout. I used Graphite to graph those and look for correlation between the metrics and times where I saw increased instances of "sporadic slow requests". It didn't help me, but might help you :)

Do let us know what you come up with.

Nitzan Shaked
  • 13,460
  • 5
  • 45
  • 54
  • 2
    We're using NewRelic. We occasionally still see spiky request queueing times http://cl.ly/image/112t2x1F3I1s I am not sure why since we should be way over capacity on dynos (20 dynos, 5 workers per dyno, double dynos, only ~1000 requests per minute). Your list of suspects looks good. I feel like the problem is still in the routing layer but this is just a guess. – Brian Armstrong May 13 '13 at 21:59
  • @Brian - Out of curiosity, why do you have 20GB of app servers for 1000 req/min? That's only 16-17 req/sec. Do you get huge spikes of 10x traffic at times? (Edit: never mind, I just realized you run coinbase... awesome btw, but easy to see where the scale comes in!) Also, did you ever figure out the reason for slow requests? I have the same problem (about 1 of every 3 requests to a basic view (no DB access) is super slow) and it drives me nuts. – orokusaki Oct 19 '13 at 15:26