Puma is actually both multithreaded and multiprocess. You can invoke it in "clustered mode" where it will spawn off multiple forked workers which will run on different cores on MRI. Since Puma is multithreaded its probably appropriate to run a number of processes equal to the number of cores on the server. So for a 4 core server something like this would be appropriate:
puma -t 8:32 -w 4 --preload
This will handle up to 32 concurrent threads, with up to 4 threads running on the CPUs concurrently and should be able to maximize the CPU resources on the server. The --preload
argument preloads the app and takes advantage of the ruby 2.0 COW improvements to the garbage collection to reduce RAM usage.
If your app spends considerable time waiting on other services (search services, databases, etc) then this will be a large improvement. When a thread blocks, another thread in the same process can grab the CPU and do work. You can support up to 32 requests in parallel in this example, while only taking the hit of running 4 processes in RAM.
With Unicorn, you would have to fork off 32 workers which would take the hit of running 32 processes in RAM, which is highly wasteful.
If all your app does is CPU crunching then this will be highly inefficient, and you should reduce the number of unicorns, and the benefits of Puma over Unicorn would be reduced. But in the Unicorn case, you have to benchmark your app and figure out the right number. Puma will tend to optimize itself by spawning more threads, and its performance should range from no worse than Unicorn (in the pure CPU case) to being vastly better than Unicorn (in the case of an app that sleeps a lot).
Of course if you use Rubinius or JRuby then its no contest, and you can spawn one process that runs multicore and handles all 32 threads.
TL;DR is that I don't think there's much advantage to Unicorn over Puma since Puma actually uses both models.
Of course I don't know anything about the reliability of Puma vs Unicorn in running production software in the real world. One thing to be concerned about is that if you scribble over any global state in one thread it can affect other requests executing at the same time which may produce indeterminate results. Since Unicorn doesn't use threads there are no concurrency issues. I would hope that by this time both Puma and Rails are mature with respect to concurrency issues and that Puma was usable in production. However, I would not necessarily expect every rails plugin and rubygem that I found on GitHub to be threadsafe, and would expect to have to do some additional work. But once you're successful enough to be finding threading problems in third party libraries you're probably large enough that you can't afford the RAM cost of running so many Unicorn processes. OTOH, I understand concurrency bugs and I'm good with Ruby, so that debugging cost may be much less for me than the cost of buying RAM in the cloud. YMMV.
Also note that I'm not sure if you should count hyperthreaded cores or physical cores in estimating the value to pass to '-w' and you'd need to perf test yourself, along with perf testing what values to use for -t. Although even if you run twice the number of processes that you 'need' to, the process scheduler in the kernel should be able to handle that without trouble until you saturate the CPU in which case you'll have larger issues anyway. I would probably recommend starting a process for each hyperthreaded core (on MRI).