I certainly won't be able to answer every conceivable question related to this. I will try to give a summary. Some of what I write here is based on what's documented here and here. My discussion here will also be focused on linux, and docker (not windows, not singularity, not podman, etc.). I'm also not likely to be able to address in detail questions like "why don't other PCI devices have to do this?". I'm also not trying to make my descriptions of how docker works perfectly accurate to an expert in the field.
The NVIDIA GPU driver has components that run in user space and also other components that run in kernel space. These components work together and must be in harmony. This means the kernel mode component(s) for driver XYZ.AB must be used only with user-space components from driver XYZ.AB (not any other version), and vice-versa.
Roughly speaking, docker is a mechanism to provide an isolated user-space linux presence that runs on top of, and interfaces to, the linux kernel (where all the kernel space stuff lives). The linux kernel is in the base machine (outside the container) and much/most of linux user space code is inside the container. This is one of the architectural factors that allow you to do neato things like run an ubuntu container on a RHEL kernel.
From the NVIDIA driver perspective, some of its components need to be installed inside the container and some need to be installed outside the container.
Can I obtain access to my NVIDIA GPUs using the stock Docker (or podman) runtime?
Yes, you can, and this is what people did before nvidia-docker or the nvidia-container-toolkit existed. You need to install the exact same driver in the base machine as well as in the container. Last time I checked, this works (although I don't intend to provide instructions here.) If you do this, the driver components inside the container match those outside the container, and it works.
What am I missing?
NVIDIA (and presumably others) would like a more flexible scenario. The above description means that if a container was built with any other driver version (than the one installed on your base machine) it cannot work. This is inconvenient.
The original purpose of nvidia-docker was to do the following: At container load time, install the runtime components of the driver, which are present in the base machine, into the container. This harmonizes things, and although it does not resolve every compatibility scenario, it resolves a bunch of them. With a simple rule "keep your driver on the base machine updated to the latest" it effectively resolves every compatibility scenario that might arise from a mismatched driver/CUDA runtime. (The CUDA toolkit, and anything that depends on it, like CUDNN, need only be installed in the container.)
As you point out, the nvidia-container-toolkit has picked up a variety of other, presumably useful, functionality over time.
I'm not spending a lot of time here talking about the compatibility strategy ("forward") that exists for compiled CUDA code, and the compatibility strategy ("backward") that exists when talking about a specific driver and the CUDA versions supported by that driver. I'm also not intending to provide instructions for use of the nvidia-container-toolkit, that is already documented, and many questions/answers about it already exist also.
I won't be able to respond to follow up questions like "why was it architected that way?" or "that shouldn't be necessary, why don't you do this?"