Project loom: what makes the performance better when using virtual threads?

Question

To give some context here, I have been following Project Loom for some time now. I have read The state of Loom. I have done asynchronous programming.

Asynchronous programming (provided by Java NIO) returns the thread to the thread pool when the task waits and it goes to great lengths to not block threads. And this gives a large performance gain, we can now handle many more request as they are not directly bound by the number of OS threads. But what we lose here, is the context. The same task is now NOT associated with just one thread. All the context is lost once we dissociate tasks from threads. Exception traces do not provide very useful information and debugging is difficult.

In comes Project Loom with virtual threads that become the single unit of concurrency. And now you can perform a single task on a single virtual thread.

It's all fine until now, but the article goes on to state, with Project Loom:

A simple, synchronous web server will be able to handle many more requests without requiring more hardware.

I don't understand how we get performance benefits with Project Loom over asynchronous APIs? The asynchrounous API:s make sure to not keep any thread idle. So, what does Project Loom do to make it more efficient and performant that asynchronous API:s?

EDIT

Let me re-phrase the question. Let's say we have an http server that takes in requests and does some crud operations with a backing persistent database. Say, this http server handles a lot of requests - 100K RPM. Two ways of implementing this:

The HTTP server has a dedicated pool of threads. When a request comes in, a thread carries the task up until it reaches the DB, wherein the task has to wait for the response from DB. At this point, the thread is returned to the thread pool and goes on to do the other tasks. When DB responds, it is again handled by some thread from the thread pool and it returns an HTTP response.
The HTTP server just spawns virtual threads for every request. If there is an IO, the virtual thread just waits for the task to complete. And then returns the HTTP Response. Basically, there is no pooling business going on for the virtual threads.

Given that the hardware and the throughput remain the same, would any one solution fare better than the other in terms of response times or handling more throughput?

My guess is that there would not be any difference w.r.t performance.

[The answer you got](https://stackoverflow.com/a/63371072/2711488), is short but nails it. Besides that, you have already linked to [a document that explains the concept in great detail](http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1.html#why). I suggest reading it, esp.the part where it explains how virtual threads are run *atop* of another executor, like a thread pool and how the synchronous calls get replaced by the asynchronous counterpart. Which makes the second approach transform to the first under the hood. So what additional information do you hope to gain from the bounty? — Holger, Aug 14 '20 at 12:50
FYI, presentations in late 2020 by Ron Pressler of Oracle on *Project Loom* technology: [here](https://youtu.be/23HjZBOIshY) and [here](https://youtu.be/zuc9JZz9Xbw). — Basil Bourque, Dec 17 '20 at 00:51

score 16 · Answer 1 · answered Aug 12 '20 at 06:34

16

We don't get benefit over asynchronous API. What we potentially will get is performance similar to asynchronous, but with synchronous code.

answered Aug 12 '20 at 06:34

talex

17,973
3
29
66

5

Precisely. It's often easier to write synchronous code because you don't have to keep writing code to put things down and pick them back up every time you can't make forward progress. Straightforward "do this, then do that, if this happens do this other thing" code is easier to write (and *much* easier to maintain) than a state machine updating explicit state. Virtual threads can give you most of the benefits of asynchronous code while your coding experience is much closer to that of writing synchronous code. – David Schwartz Aug 19 '20 at 00:05
4

@DavidSchwartz that’s even more true when it comes to analyzing exceptions or debugging the code. Further, cancelling an ongoing operation is much easier with a virtual thread. And thread local vars will work as intended. But that’s all already explained in the document linked the question… – Holger Aug 19 '20 at 08:21

score 9 · Accepted Answer · answered Dec 09 '20 at 12:47

The answer by @talex puts it crisply. Adding further to it.

Loom is more about a native concurrency abstraction, which additionally helps one write asynchronous code. Given its a VM level abstraction, rather than just code level (like what we have been doing till now with CompletableFuture etc), It lets one implement asynchronous behavior but with reduce boiler plate.

With Loom, a more powerful abstraction is the savior. We have seen this repeatedly on how abstraction with syntactic sugar, makes one effectively write programs. Whether it was FunctionalInterfaces in JDK8, for-comprehensions in Scala.

With loom, there isn't a need to chain multiple CompletableFuture's (to save on resources). But one can write the code synchronously. And with each blocking operation encountered (ReentrantLock, i/o, JDBC calls), the virtual-thread gets parked. And because these are light-weight threads, the context switch is way-cheaper, distinguishing itself from kernel-threads.

When blocked, the actual carrier-thread (that was running the run-body of the virtual thread), gets engaged for executing some other virtual-thread's run. So effectively, the carrier-thread is not sitting idle but executing some other work. And comes back to continue the execution of the original virtual-thread whenever unparked. Just like how a thread-pool would work. But here, you have a single carrier-thread in a way executing the body of multiple virtual-threads, switching from one to another when blocked.

We get the same behavior (and hence performance) as manually written asynchronous code, but instead avoiding the boiler-plate to do the same thing.

Consider the case of a web-framework, where there is a separate thread-pool to handle i/o and the other for execution of http requests. For simple HTTP requests, one might serve the request from the http-pool thread itself. But if there are any blocking (or) high CPU operations, we let this activity happen on a separate thread asynchronously.

This thread would collect the information from an incoming request, spawn a CompletableFuture, and chain it with a pipeline (read from database as one stage, followed by computation from it, followed by another stage to write back to database case, web service calls etc). Each one is a stage, and the resultant CompletablFuture is returned back to the web-framework.

When the resultant future is complete, the web-framework uses the results to be relayed back to the client. This is how Play-Framework and others, have been dealing with it. Providing an isolation between the http thread handling pool, and the execution of each request. But if we dive deeper in this, why is it that we do this?

One core reason is to use the resources effectively. Particularly blocking calls. And hence we chain with thenApply etc so that no thread is blocked on any activity, and we do more with less number of threads.

This works great, but quite verbose. And debugging is indeed painful, and if one of the intermediary stages results with an exception, the control-flow goes hay-wire, resulting in further code to handle it.

With Loom, we write synchronous code, and let someone else decide what to do when blocked. Rather than sleep and do nothing.

`This works great, but quite verbose. And debugging is indeed painful, and if one of the intermediary stages results with an exception, the control-flow goes hay-wire, resulting in further code to handle it.` Why do you say that it's verbose? Rather which part is verbose here? The future part? — Ashwin, Dec 14 '20 at 02:54
Yes. Having said it, I think future's are unavoidable for several other scenarios.... Especially let's say when one wants to do parallel activities (like make multiple web service calls and merge the results). But each of those activity could be synchonous.. — Jatin, Dec 14 '20 at 03:46
@Jatin, this part about completable future is misleading: "One core reason is to use the resources effectively. Particularly blocking calls. And hence we chain with thenApply etc so that no thread is blocked on any activity, and we do more with less number of threads." There isn't any magic CompletableFuture could do with system threads. We won't get by with lesser number of threads keeping throughput scalable - completable task supplier in supplyAsync will block the thread it's called in (borrowed from executor provided or default ForkJoinPool). — RoK, Aug 09 '22 at 05:23
@Jatin, but with virtual threads we indeed can benefit in lower number of threads provided for CF suppliers through newVirtualThreadPerTaskExecutor, even with just one thread using property jdk.virtualThreadScheduler.maxPoolSize = 1 — RoK, Aug 09 '22 at 07:24

score 4 · Answer 3 · answered Aug 18 '20 at 12:03

The http server has a dedicated pool of threads .... How big of a pool? (Number of CPUs)*N + C? N>1 one can fall back to anti-scaling, as lock contention extends latency; where as N=1 can under-utilize available bandwidth. There is a good analysis here.
The http server just spawns... That would be a very naive implementation of this concept. A more realistic one would strive for collecting from a dynamic pool which kept one real thread for every blocked system call + one for every real CPU. At least that is what the folks behind Go came up with.

The crux is to keep the {handlers, callbacks, completions, virtual threads, goroutines : all PEAs in a pod} from fighting over internal resources; thus they do not lean on system based blocking mechanisms until absolutely necessary This falls under the banner of lock avoidance, and might be accomplished with various queuing strategies (see libdispatch), etc.. Note that this leaves the PEA divorced from the underlying system thread, because they are internally multiplexed between them. This is your concern about divorcing the concepts. In practice, you pass around your favourite languages abstraction of a context pointer.

As 1 indicates, there are tangible results that can be directly linked to this approach; and a few intangibles. Locking is easy -- you just make one big lock around your transactions and you are good to go. That doesn't scale; but fine-grained locking is hard. Hard to get working, hard to choose the fineness of the grain. When to use { locks, CVs, semaphores, barriers, ... } are obvious in textbook examples; a little less so in deeply nested logic. Lock avoidance makes that, for the most part, go away, and be limited to contended leaf components like malloc().

I maintain some skepticism, as the research typically shows a poorly scaled system, which is transformed into a lock avoidance model, then shown to be better. I have yet to see one which unleashes some experienced developers to analyze the synchronization behavior of the system, transform it for scalability, then measure the result. But, even if that were a win experienced developers are a rare(ish) and expensive commodity; the heart of scalability is really financial.

Project loom: what makes the performance better when using virtual threads?

EDIT

3 Answers3

Linked