12

I want to support around 10,000 simultaneous HTTP clients on a small cluster of machines (as small as possible). I'd like to keep a connection to each client alive while the user is using the application, to allow the server to push updates.

I believe that async IO is often recommended for these kinds of long-lived connections, to avoid having lots of threads sitting idle. But what are the issues in having threads sitting idle? I find the threaded model mentally easier to work with, but I don't want to do something that is going to cause me major headaches. I guess I'll have to experiment, but I wondered if anyone knows of any previous experiments along these lines?

nosid
  • 48,932
  • 13
  • 112
  • 139
ᴇʟᴇvᴀтᴇ
  • 12,285
  • 4
  • 43
  • 66
  • 2
    The problem really is in the forking of the 10k threads themselves. This is going to use a ton of stack space to do so. If you have the memory then it may work for you but otherwise a NIO solution is better albeit more complicated. – Gray Mar 04 '14 at 21:07
  • 3
    http://stackoverflow.com/a/17771219/1305501 – nosid Mar 04 '14 at 21:08
  • 1
    Is that 10k different HTTP servers as well? If not you may want to consider using HTTP keepalive and reuse existing connections – fge Mar 04 '14 at 21:09
  • 1
    @Gray Do you have an estimate on the actual stack usage of a typical thread? Maximum stack size isn't relevant since the full size isn't committed to physical RAM. – Marko Topolnik Mar 04 '14 at 21:12
  • 1
    @CodeChimp A Linux configured properly as a server-class machine can support connections in the millions on a regular basis. – Marko Topolnik Mar 04 '14 at 21:13
  • Default depends on the OS and JVM. See here: http://www.onkarjoshi.com/blog/209/using-xss-to-adjust-java-default-thread-stack-size-to-save-memory-and-prevent-stackoverflowerror/ – Gray Mar 04 '14 at 21:15
  • I believe maximum space is all that matters @MarkoTopolnik -- at least in terms of running out. Sure a portion of it will not be resident. Depends a ton on the application I'd guess. – Gray Mar 04 '14 at 21:16
  • @Gray The -Xss switch sets the maximum size *per thread*, I don't know of a max total space for stacks. I know it depends a lot, but inspired by a recent question I did start wondering what exactly is a ballpark figure for a typical setup, say JBoss and Spring. My gut feeling says it's less than 64K peak usage, and an idle thread is taking probably the minimum it can---a single memory page (4K, 8K, something like that). – Marko Topolnik Mar 04 '14 at 21:19
  • Yeah there's no way to control the total space used by all threads without controlling the number of threads and their per-thread stack space @MarkoTopolnik. I bet a single thread is taking a lot more than 4k on occasion. Depends highly on the application but I've seen some _monster_ stack traces -- especially when in web handler chains. – Gray Mar 04 '14 at 21:23
  • @Gray Sure, we all know about those... let's say it reaches 200 stack frames, that's four screenfuls of stack trace. Each frame could be... 128 bytes on average (many methods are very small and just delegate to further methods), that's still less than 32K total stack size. – Marko Topolnik Mar 04 '14 at 21:33
  • My comments are for posterity @MarkoTopolnik. I know you know this stuff. The default is 64k or 128k or something which could happen depending on the call frames I guess. – Gray Mar 04 '14 at 21:36
  • @Gray No, I'm seriously wondering about this because I caught myself not taking all this into account. If the size turns out to be that trivial, then classical Java blocking I/O doesn't sound like a bad thing at all, contrary to my current belief. And blocking I/O is so much simpler to code against. – Marko Topolnik Mar 04 '14 at 21:38
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/49000/discussion-between-gray-and-marko-topolnik) – Gray Mar 04 '14 at 21:40
  • 1
    Here's a relevant link for OP: https://www.usenix.org/legacy/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html – Marko Topolnik Mar 04 '14 at 22:00
  • @Marko: Note that the article is more than 10 years old. – nosid Mar 04 '14 at 22:04
  • @nosid I sure did note that, and if you actually read it, you'll see that the arguments apply only *more* today than ten years ago. – Marko Topolnik Mar 05 '14 at 06:09
  • @Marko would you be able to summarize your thoughts into a fully-fledged answer? – ᴇʟᴇvᴀтᴇ Mar 06 '14 at 11:39

3 Answers3

5

Asynchronous I/O basically means that your application does most of the thread scheduling. Instead of letting the OS randomly suspend your thread and schedule another one, you have only as many threads as there are CPU cores and yield to other tasks at the most appropriate points—when the thread reaches an I/O operation, which will take some time.

The above seems like a clear win from the performance standpoint, but the asynchronous programming model is much more complex in several regards:

  1. it can't be expressed as a single function call so the workflow is not obvious, especially when transfer of control flow due to exceptions is considered;
  2. without specifically targetted support from the programming language the idioms are very messy: spaghetti code and/or extremely weak signal-to-noise ratio are the norm;
  3. mostly due to 1. above debugging is much more difficult as the stack trace does not represent the progress within a unit of work as a whole;
  4. execution jumps from thread to thread within a pool (or even several pools, where each layer of abstraction has its own) so profiling and monitoring with the usual tools is rendered virtually useless.

On the other hand, many favorable improvements and optimizations have happened to the modern OS's which mostly eliminate the performance downsides of synchronous I/O programming:

  • the address space is huge, so space reserved for stacks isn't a problem;
  • the actual physical RAM load of call stacks is not very large as only the part of the stack actually used by a thread is committed to RAM, and a call stack doesn't normally exceed 64K;
  • context switching, which used to be prohibitively expensive for larger thread counts, has been improved to the point where its overhead is negligible for all practical purposes.

A classical paper going through much of the above and some other points is a good complement to what I am saying here:

https://www.usenix.org/legacy/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
2

There are already some good pointers in the comments of your question.

The reason for not using 10K threads is this costs memory resources, and memory costs energy. The programming model is no argument, because the thread sitting on the client connection, mustn't be the same that wants to post the event.

Please take a look at the websockets standard and the asynchronous request processing model in the Servlet 3.0 standard. All recent java web application servers implement it now (e.g. Glassfish and Tomcat) and it is the solution for your problem.

The question itself cannot be answered since the OS, JVM and application server you use is missing. However, you can test it quite fast by yourself, by just creating a servlet or JSP with Thread.sleep(9999999) and doing siege -c 10000 ... on it.

lucian.pantelimon
  • 3,673
  • 4
  • 29
  • 46
cruftex
  • 5,545
  • 2
  • 20
  • 36
1

10,000 simultaneous HTTP clients...what are the issues in having threads sitting idle?

It seems that cost of an idle thread is only memory allocated for kernel structure (a few kb) and thread's stack (512kb-a number of mb). But...

Obviously, you are going to wake up each of your n-hundred threads from time to time, right? And that is a moment when you pay the cost of the context switching, which may be not so small (time to call the system scheduler, more cache misses, etc). See, for instance: http://www.cs.rochester.edu/u/cli/research/switch.pdf

And you will have to pin your threads very carefully to don't affect the system ones. As result, thread-per-connection (on blocking IO) architecture can increase the latency of the system comparing to async IO. But it still can work for your case if almost all threads are parked most of the time.

And the final word. We don't know how many time your threads are going to be blocked on read() and how many work they need to do to process the received data. What hardware, OS and network interfaces are going to be used... So, test a prototype of your system.

AnatolyG
  • 1,557
  • 8
  • 14