This question is about low-level async I/O system calls like send + epoll/aio_read and others. I am asking about both network I/O and disk I/O.
The naive way of implementing those async calls would be to create a thread for each asynchronous I/O request, which would then do the request in a synchronous way. Obviously, this naive solution scales badly with a large number of parallel requests. Even if a thread pool was used, we would still need to have one thread for each parallel request.
Therefore, I speculate that this is done in the following more efficient way:
For writing/sending data:
Append the send-request to some kernel-internal async I/O queue.
Dedicated "write-threads" are picking up these send-requests in such a way, that the target hardware is fully utilized. For this, a special I/O scheduler might be used.
Depending on the target hardware, the write-requests get dispatched eventually, e.g. via Direct Memory Access (DMA).
For reading/receiving data:
The hardware raises an I/O interrupt that jumps into an I/O interrupt handler of the kernel.
The interrupt handler appends a notification to a read-queue and returns quickly.
Dedicated "read-threads" pick up the notifications of the read-queue and perform two tasks: 1) Copy the read data to the target buffer if necessary. 2.) Notify the target process in some way if necessary (e.g.epoll, signals,..).
For all of this, there is no need to have more write-threads or read-threads than the number of CPU cores. Hence, the parallel requests scalability problem would be solved.
How is this implemented in real OS kernels? Which of these speculations are true?