The re-invigoration of asynchronous programming has little to do with OS/kernel architecture to date. OSes have blocked the way you describe since the 1960s; although the reason has changed somewhat.
In early systems, efficiency was measured by useful CPU cycles; so switching tasks when one was blocked was a natural act.
Modern systems architecture is frequently addressing how to keep the CPUs occupied; for example if there are 800 CPUs but 20 tasks; then 780 CPUs have nothing to do.
As a concrete example, a program to sum all the bytes of a file might look a bit like:
for (count=0; (c = getchar()) != EOF; count += c) {}
A multi-threaded version, for performance increase might look like:
for (n=0; n < NCPU; n++) {
if (threadfork() == 0) {
offset_t o = n* (file_size / NCPU);
offset_t l = (file_size / NCPU);
for (count = 0; l-- && pread(fd, &c, 1, o) == 1; count += c) {}
threadexit(count);
}
}
for (n=0; n < NCPU; n++) {
threadwait(&temp);
total += temp;
}
return total;
which is a bit grim, both because it is complex, and probably has inconsistent speed-ups.
In comparison the function:
int GetSum(char *data, int len) {
int count = 0;
while (len--) {
count += *data++;
}
return count;
}
I could construct a sort of dispatcher which, when a lump of file data became available in ram, invoked GetSum() on it, queuing its return value for later accumulation. This dispatcher could invest in familiarity with optimal i/o patterns etc.. since it may be applicable to many problems; and the programmer has a considerably simpler job to do.
Even without that sort of native support; I could mmap(2) the file, then dispatch many threads to just touch a page, then invoke GetSum on that page. This would effectively emulate an asynchronous model in a plain old unix-y framework.
Of course nothing is quite that easy; even a progress bar in a dispatch-oriented asynchronous model is dubious at best (not that the 1950s- based sequential ones were anything to write home about ). Communicating errors is also cumbersome; and because you use asynch to direct maximum cpu resources at yourself, you need to minimize synchronization operations (duh, async :).
Async has a lot of possibilities; but it really needs languages with defacto async support, not as an aspirational nod from the latest du jour standard of some rickety language.