There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.
It's an abstraction layer. It's easier for many people to grasp this concept and use it more effectively in many scenarios. It's also easier for many machines (assuming a good abstraction), since the model moves from width to pull in many cases. With pthreads (as an example), you have all the control. With other threading models, the idea is to reuse threads, for the process of creating a concurrent task to be inexpensive, and to use a completely different threading model. It's far easier to digest this model; there's less to learn and measure, and the results are generally good.
What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.
Creating a thread is expensive, and the stack requires memory. As well, if your process is using many threads, then context switching can kill performance. So lightweight threading models became useful for a number of reasons. Creating an OS thread became a good solution for medium to large tasks, ideally in low numbers. That's restrictive, and quite time consuming to maintain.
A task/thread pool/userland thread does not need to worry about much of the context switching or thread creation. It's often "reuse the resource when it becomes available, if it's not ready now -- also, determine the number of active threads for this machine".
More commmonly (IMO), OS level threads are expensive because they are not used correctly by the engineers - either there are too many and there is a ton of context switching, there is competition for the same set of resources, the tasks are too small. It takes much more time to understand how to use OS threads correctly, and how to apply that best to the context of a program's execution.
The OS provides both of this functions for free.
They're available, but they are not free. They are complex, and very important to good performance. When you create an OS thread, it's given time 'soon' -- all the process' time is divided among the threads. That's not the common case with user threads. The task is often enqueued when the resource is not available. This reduces context switching, memory, and the total number of threads which must be created. When the task exits, the thread is given another.
Consider this analogy of time distribution:
- Assume you are at a casino. There are a number people who want cards.
- You have a fixed number of dealers. There are fewer dealers than people who want cards.
- There is not always enough cards for every person at any given time.
- People need all cards to complete their game/hand. They return their cards to the dealer when their game/hand is complete.
How would you ask the dealers to distribute cards?
Under the OS scheduler, that would be based on (thread) priority. Every person would be given one card at a time (CPU time), and priority would be evaluated continually.
The people represent the task or thread's work. The cards represent time and resources. The dealers represent threads and resources.
How would you deal fastest if there were 2 dealers and 3 people? and if there were 5 dealers and 500 people? How could you minimize running out of cards to deal? With threads, adding cards and adding dealers is not a solution you can deliver 'on demand'. Adding CPUs is equivalent to adding dealers. Adding threads is equivalent to dealers dealing cards to more people at a time (increases context switching). There are a number of strategies to deal cards more quickly, especially after you eliminate the people's need for cards in a certain amount of time. Would it not be faster to go to a table and deal to a person or people until their game is complete if the dealer to people ratio were 1/50? Compare this to visiting every table based on priority, and coordinating visitation among all dealers (the OS approach). That's not to imply the OS is stupid -- it implies that creating an OS thread is an engineer adding more people and more tables, potentially more than the dealers can reasonably handle. Fortunately, the constraints may be lifted in many cases by using other multithreading models and higher abstractions.
Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?
If you developed a performance critical low level threading library (e.g. upon pthreads), you would recognize the importance of reuse (and implement it in your library as a model available for users). From that angle, the importance of higher level multithreading models is a simple and obvious solution/optimization based on real world usage as well as the ideal that the entry bar for adopting and effectively utilizing multithreading can be lowered.
It's not that they are expensive -- the lightweight threads' model and pool is a better solution for many problems, and a more appropriate abstraction for engineers who do not understand threads well. The complexity of multithreading is greatly simplified (and often more performant in real world usage) under this model. With OS threads, you do have more control, but several more considerations must be made to use them as effectively as possible -- heeding these consideration can dramatically reflow a program's execution/implementation. With higher level abstractions, many of these complexities are minimized by completely altering the flow of task execution (width vs pull).