47

There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.

What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.

The OS provides both of this functions for free. Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?

Chi-Lan
  • 3,575
  • 3
  • 22
  • 24
  • They're not just considered expensive, they are. I believe some green threads (Haskell's?) weigh just a couple kilobytes each, i.e. are hundred times smaller. Another issue: standard Python threads are not green - they have some problems with multithreading due to the GIL, but they are nevertheless real OS threads (maybe you're thinking of `greenlets`? Those are a different story, and indeed similar to green threads). –  Apr 01 '12 at 14:00
  • 1
    @delnan OK, I heard that. But I'm still not sure why should they be more expensive. Both need to save the stack and to do context switch (ignore the GIL, there are many non-python examples). – Chi-Lan Apr 01 '12 at 14:02

7 Answers7

20

I want to amend Tudors answer which is a good starting point. There are two main overheads of threads:

  1. Starting and stopping them. Involves creating a stack and kernel objects. Involves kernel transitions and global kernel locks.
  2. Keeping their stack around.

(1) is only a problem if you are creating and stopping them all the time. This is solved commonly using thread pools. I consider this problem to be practically solved. Scheduling a task on a thread pool usually does not involve a trip to the kernel which makes it very fast. The overhead is on the order of a few interlocked memory operations and a few allocations.

(2) This becomes important only if you have many threads (> 100 or so). In this case async IO is a means to get rid of the threads. I found that if you don't have insane amounts of threads synchronous IO including blocking is slightly faster than async IO (you read that right: sync IO is faster).

usr
  • 168,620
  • 35
  • 240
  • 369
  • 2
    (1) I'm not sure why kernel object are more expensive than userspace objects you need locks anyhow, and all locks boils down to OS=kernle lock. I don't understand (2) you need to keep their stack anyhow. – Chi-Lan Apr 01 '12 at 14:36
  • Not all thread alternatives keep the stack around, for example in the case where a future/task has not started executing yet. Also, OS thread stacks can be more heavy. .NET stack always commit 1MB of memory (which is unfortunate). – usr Apr 01 '12 at 14:39
  • 2
    As for (1): Locks don't boil down to kernel locks. Many optimizations are possible for uncondended and/or shortly-held locks. Kernel objects have more overhead for many reasons (for example they can be shared across processes, can have ACLs, ...). They also require a kernel mode transition. – usr Apr 01 '12 at 14:41
  • 2
    OK, got 1 covered, thanks and I guess the answer for (2) is "some thread systems are implemented in defective manner, so they need a new model to amend stack problem they caused ;-)". – Chi-Lan Apr 01 '12 at 14:41
  • This is a valid rephrasing ;-) – usr Apr 01 '12 at 14:43
  • 1
    A great answer, you've nailed the major threading concerns with regard to concurrency. – Matt Joiner Jun 18 '12 at 07:01
  • To illustrate (1), I'm using with Python threads for I/O and migrating from short-lived threads to the standard library ThreadPoolExecutor gave me a >20% speed boost. – Liz Av Aug 15 '18 at 22:07
  • Why can't kernel threads be designed to act like userspace coroutines? Could you explain in more detail what kind of flexibility that disallow a kernel thread from being optimized (via kernel changes)? – SOFe Aug 10 '19 at 12:53
  • @SOFe the kernel certainly could use coroutines (more). In essence, every callback is a manually written coroutine. I am certainly not claiming the kernel can't do that. Threads can be very CPU efficient because there is no switching and allocation overhead for coroutines. Both problems I spoke about in the answer are not a practical issue in the kernel. – usr Aug 10 '19 at 15:25
14

Saving the stack is trivial, no matter what its size - the stack pointer needs to be saved in the Thread Info Block in the kernel, (so usualy saving most of the registers as well since they will have been pushed by whatever soft/hard interrupt caused the OS to be entered).

One issue is that a protection level ring-cycle is required to enter the kernel from user. This is an essential, but annoying, overhead. Then the driver or system call has to do whatever was requested by the interrupt and then the scheduling/dispatching of threads onto processors. If this results in the preemption of a thread from one process by a thread from another, a load of extra process context has to be swapped as well. Even more overhead is added if the OS decides that a thread that is running on another processor core than the one handling the interrupt mut be preempted - the other core must be hardware-interrupted, (this is on top of the hard/soft interrupt that entred the OS in the first place.

So, a scheduling run may be quite a complex operation.

'Green threads' or 'fibers' are, (usually), scheduled from user code. A context-change is much easier and cheaper than an OS interrupt etc. because no Wagnerian ring-cycle is required on every context-change, process-context does not change and the OS thread running the green thread group does not change.

Since something-for-nothing does not exist, there are problems with green threads. They ar run by 'real' OS threads. This means that if one 'green' thread in a group run by one OS thread makes an OS call that blocks, all green threads in the group are blocked. This means that simple calls like sleep() have to be 'emulated' by a state-machine that yields to other green threads, (yes, just like re-implementing the OS). Similarly, any inter-thread signalling.

Also, of course, green threads cannot directly respond to IO signaling, so somewhat defeating the point of having any threads in the first place.

Martin James
  • 24,453
  • 3
  • 36
  • 60
9

There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.

It's an abstraction layer. It's easier for many people to grasp this concept and use it more effectively in many scenarios. It's also easier for many machines (assuming a good abstraction), since the model moves from width to pull in many cases. With pthreads (as an example), you have all the control. With other threading models, the idea is to reuse threads, for the process of creating a concurrent task to be inexpensive, and to use a completely different threading model. It's far easier to digest this model; there's less to learn and measure, and the results are generally good.

What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.

Creating a thread is expensive, and the stack requires memory. As well, if your process is using many threads, then context switching can kill performance. So lightweight threading models became useful for a number of reasons. Creating an OS thread became a good solution for medium to large tasks, ideally in low numbers. That's restrictive, and quite time consuming to maintain.

A task/thread pool/userland thread does not need to worry about much of the context switching or thread creation. It's often "reuse the resource when it becomes available, if it's not ready now -- also, determine the number of active threads for this machine".

More commmonly (IMO), OS level threads are expensive because they are not used correctly by the engineers - either there are too many and there is a ton of context switching, there is competition for the same set of resources, the tasks are too small. It takes much more time to understand how to use OS threads correctly, and how to apply that best to the context of a program's execution.

The OS provides both of this functions for free.

They're available, but they are not free. They are complex, and very important to good performance. When you create an OS thread, it's given time 'soon' -- all the process' time is divided among the threads. That's not the common case with user threads. The task is often enqueued when the resource is not available. This reduces context switching, memory, and the total number of threads which must be created. When the task exits, the thread is given another.

Consider this analogy of time distribution:

  • Assume you are at a casino. There are a number people who want cards.
  • You have a fixed number of dealers. There are fewer dealers than people who want cards.
  • There is not always enough cards for every person at any given time.
  • People need all cards to complete their game/hand. They return their cards to the dealer when their game/hand is complete.

How would you ask the dealers to distribute cards?

Under the OS scheduler, that would be based on (thread) priority. Every person would be given one card at a time (CPU time), and priority would be evaluated continually.

The people represent the task or thread's work. The cards represent time and resources. The dealers represent threads and resources.

How would you deal fastest if there were 2 dealers and 3 people? and if there were 5 dealers and 500 people? How could you minimize running out of cards to deal? With threads, adding cards and adding dealers is not a solution you can deliver 'on demand'. Adding CPUs is equivalent to adding dealers. Adding threads is equivalent to dealers dealing cards to more people at a time (increases context switching). There are a number of strategies to deal cards more quickly, especially after you eliminate the people's need for cards in a certain amount of time. Would it not be faster to go to a table and deal to a person or people until their game is complete if the dealer to people ratio were 1/50? Compare this to visiting every table based on priority, and coordinating visitation among all dealers (the OS approach). That's not to imply the OS is stupid -- it implies that creating an OS thread is an engineer adding more people and more tables, potentially more than the dealers can reasonably handle. Fortunately, the constraints may be lifted in many cases by using other multithreading models and higher abstractions.

Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?

If you developed a performance critical low level threading library (e.g. upon pthreads), you would recognize the importance of reuse (and implement it in your library as a model available for users). From that angle, the importance of higher level multithreading models is a simple and obvious solution/optimization based on real world usage as well as the ideal that the entry bar for adopting and effectively utilizing multithreading can be lowered.

It's not that they are expensive -- the lightweight threads' model and pool is a better solution for many problems, and a more appropriate abstraction for engineers who do not understand threads well. The complexity of multithreading is greatly simplified (and often more performant in real world usage) under this model. With OS threads, you do have more control, but several more considerations must be made to use them as effectively as possible -- heeding these consideration can dramatically reflow a program's execution/implementation. With higher level abstractions, many of these complexities are minimized by completely altering the flow of task execution (width vs pull).

justin
  • 104,054
  • 14
  • 179
  • 226
5

The problem with starting kernel threads for each small task is that it incurs a non-negligible overhead to start and stop, coupled with the stack size it needs.

This is the first important point: thread pools exist so that you can recycle threads, in order to avoid wasting time starting them as well as wasting memory for their stacks.

Secondly, if you fire off threads to do asynchronous I/O, they will spend most of their time blocked waiting for the I/O to complete, thus effectively not doing any work and wasting memory. A much better option is to have a single worker handle multiple async calls (through some under-the-hood scheduling technique, such as multiplexing), thus again saving memory and time.

One thing that makes "green" threads faster than kernel threads is that they are user-space objects, managed by a virtual machine. Starting them is a user space call, while starting a thread is a kernel-space call that is much slower.

Tudor
  • 61,523
  • 12
  • 102
  • 142
  • I don't understand why does it have more overhead. How is it different with "green" threads. You have to keep their stack, so you're wasting the same amount of memory. – Chi-Lan Apr 01 '12 at 14:00
  • 1
    @Chi-Lan: A "green" thread may not be a real thread, but an abstraction of a thread. Several green threads may be intelligently scheduled on the same kernel thread for efficient use, for example using windows fibers to do cooperative scheduling. – Tudor Apr 01 '12 at 14:01
  • @Chi-Lan: "green"/"lightweight" threads are made to avoid this problem. Examples of these are in Haskell, Erlang, and Python. – L̲̳o̲̳̳n̲̳̳g̲̳̳p̲̳o̲̳̳k̲̳̳e̲̳̳ Apr 01 '12 at 14:09
  • 2
    @Tudor, Hmmm.. I understand that, but you must explain why. What an OS thread need to keep that a regular thread don't. Both must keep the stack of the user-thread, both have to schedule them. Than why is the OS thread expensive? – Chi-Lan Apr 01 '12 at 14:13
  • I downvoted because your answer doesn't answer my question. You said "we need green threads because it saves us OS threads", but I'm asking "why can't we just use OS threads. I don't understand why should OS threads cost more than green threads, both need to save the stack and schedule, yet they cost more. Why is it then?" Edit an explanation in, and I'll remove my downvote. It's a good answer - but not for my question. – Chi-Lan Apr 01 '12 at 14:33
  • @Chi-Lan: Ok, I've included an explanation stating that a user-space call to create a green thread is much faster than a kernel-space call to create a regular thread. I'll look into the issue of scheduling in the meantime. – Tudor Apr 01 '12 at 14:37
  • @Tudor OK, your explanation holds only if you need to start and stop them a lot, but fair point. – Chi-Lan Apr 01 '12 at 14:40
2

A person in Google shows an interesting approach.

According to him, kernel mode switching itself is not the bottleneck, and the core cost happen on SMP scheduler. And he claims M:N schedule assisted by kernel wouldn't be expensive, and this makes me to expect general M:N threading to be available on every languages.

eonil
  • 83,476
  • 81
  • 317
  • 516
1

Because the OS. Imagine that instead of asking you to clean the house your grandmother has to call the social service that does some paperwork and a week after assigns a social worker for helping her. The worker can be called off at any time and replaced with another one, which again takes several days.

That's pretty ineffective and slow, huh?

In this metaphor you are a userland coroutine scheduler, the social service is an OS with its kernel-level thread scheduler, and a social worker is a fully-fledged thread.

raiks
  • 1,270
  • 1
  • 15
  • 12
0

I think the two things are in different levels.

Thread or Process is an instance of the program which is being executed. In a process/thread there is much more things in it. Execution stack, opening files, signals, processors status, and a many other things.

Greentlet is different, it is runs in vm. It supplies a light-weight thread. Many of them supply a pseudo-concurrently (typically in a single or a few OS-level threads). And often they supply a lock-free method by data-transmission instead of data sharing.

So, the two things focus different, so the weight are different.

And In my mind, the greenlet should be finished in the VM not the OS.

llj098
  • 1,404
  • 11
  • 13