Seeing long delays due to EventMachine quantum – how can I speed things up?

Question

I am developing a real-time application using EventMachine. Two clients, A and B, connect to an EventMachine server over standard TCP, or via WebSocket with em-websocket.

Every time data goes through EventMachine, code execution takes a 95ms hit. When A talks to the server, there is a 95ms delay. When A talks to B, then there is a 190ms delay.

If many requests occur in rapid succession, the delay disappears, except for the final request in the sequence. So, if I send 10 rapid requests, I'll get 9 responses after about 5ms each, but the 10th response will take 95ms again.

I've deduced that this has something to do with EventMachine.set_quantum. From the docs:

Method: EventMachine.set_quantum

For advanced users. This function sets the default timer granularity, which by default is slightly smaller than 100 milliseconds. Call this function to set a higher or lower granularity. The function affects the behavior of add_timer and add_periodic_timer. Most applications will not need to call this function.

Avoid setting the quantum to very low values because that may reduce performance under some extreme conditions. We recommend that you not use values lower than 10.

Well, that explains where the 95ms came from. Sure enough, the delays change by calling EventMachine.set_quantum, but I am wary of tweaking this value because of the warning in the documentation.

What is set_quantum actually doing? I can't find any documentation or explanation about what the quantum variable means.

What can I do to reduce these delays? I'd like to understand the potential repercussions of decreasing the quantum to, say, 10ms.

Is EventMachine even the right choice? I'm essentially using it as a glorified TCP connection. Maybe I should just stick to raw sockets for inter-process communication, and find a WebSocket server gem that doesn't use EventMachine.

Hi Schrockwell. Did you look over [the Plezi web-app framework](https://github.com/boazsegev/plezi)? it supports native websockets and doesn't use Rack or EventMachine (it runs a native Ruby server). Also, it's super easy to implement. If you try it out, can you let me know what you think of it's performance? — Myst, Jun 02 '15 at 01:39

score 2 · Answer 1 · answered May 31 '15 at 17:35

2

EventMachine is constantly running a loop, where it checks:

Whether any timers got triggered.
If any of the file descriptors have something to do with them.

The second step involves the appropriate mechanism under the hood, e.g. the select(..) call. This is where that quantum value goes. So basically the loop looks rather like this:

Any timers triggered?
Any of the file descriptors have something to do with them? Wait for them, up to quantum millis.
Unless there's a shutdown request, go to the 1st step.

Therefore setting quantum to the lower values will make that loop be iterated more often, thus eating up the CPU cycles. I don't think that could really be an issue though.

What surprises me is that you have that communication delay at all, since all of those querying mechanisms (select, or epoll, or whatever) return immediately if there's an event (e.g. data) on the file descriptor. That basically means that you shouldn't be incurring those delays at all. And if that delay was by design, then numerous Thin users would've already been pretty upset about it.

All of this makes me think that there is something slightly not right in your code that makes it work that way. Unfortunately, I can't tell much more than that unless I see it.

Hope it helps!

answered May 31 '15 at 17:35

SkyWriter

1,454
10
17

Thank you for the explanation. It makes sense that the select/epoll should wake up immediately on an event, and it sounds like that is exactly what is not happening. I'm developing and testing on OS X – could it be an idiosyncrasy of OS X? – Schrockwell Jun 01 '15 at 18:51
According to this SO thread, OS X does not support epoll: http://stackoverflow.com/questions/13856413/does-os-x-not-support-epoll-function – Schrockwell Jun 01 '15 at 18:57
I did a bit more research. It appears that `select` requires polling by EM and will NOT return automatically, but `epoll` will automatically wake up EM when an event occurs, like you have described. OS X doesn't have `epoll`, so that must be why EM is falling back to `select` polling. OS X has an equivalent of `epoll` called `kqueue` which can be enabled by calling `EM.kqueue = EM.kqueue?`. http://www.paperplanes.de/2011/4/25/eventmachine-how-does-it-work.html – Schrockwell Jun 01 '15 at 19:04
I don't think so. `select` is a pretty standard call, and it indeed returns immediately in case any of the descriptors are ready. OSX `man select` clearly defines that as well: "Select() returns the number of ready descriptors that are contained in the descriptor sets, or -1 if an error occurred. If the time limit expires, select() returns." It might make sense to try and assemble a vanilla TCP echo server on EM and checking, whether the problem is really in EM (e.g. using the code from here: https://github.com/eventmachine/eventmachine/wiki/Code-Snippets). – SkyWriter Jun 02 '15 at 03:15
Both `select` and `epoll` accept an array of IO objects. It seems to me that the `quantum` is there to deal both with timers and with situations where your connection (IO socket) is new and was added to the list of connections only **after** EM started waiting. If a new connection will throw EM out of the waiting loop, the issue would not be noticed. But the last connection will wait for the quantum timeout. – Myst Jun 06 '15 at 16:23

Seeing long delays due to EventMachine quantum – how can I speed things up?

1 Answers1