Loads/stores don't go to directly to memory (unless you use them on an uncacheable memory region). Even NT stores go into a write-combining fill buffer.
Load/stores go between execution units and L1D cache. CPUs internally have wide data paths from cache to execution units, and from L1D to outer caches. See How can cache be that fast? on electronics.SE, about Intel IvyBridge.
e.g. IvB has 128b data paths between execution units and L1D. Haswell widened that to 256 bits. Unaligned loads/stores have full performance as long as they don't cross a cache-line boundary. Skylake-AVX512 widened that to 512 bits, so it can do 2 64-byte loads and a 64-byte store in a single clock cycle. (As long as data is hot in L1D cache).
AMD CPUs including Ryzen handle 256b vectors in 128b chunks (even in the execution units, unlike Intel after Pentium M). Older CPUs (e.g. Pentium III and Pentium-M) split 128b loads/stores (and vector ALU) into two 64-bit halves because their load/store execution units were only 64 bits wide.
The memory controllers are DDR2/3/4. The bus is 64-bits wide, but uses a burst mode with a burst size of 64 bytes (not coincidentally, the size of a cache line.)
Being a "64-bit" CPU is unrelated to the width of any internal or external data buses. That terminology did get used for other CPUs in the past, but even P5 Pentium had a 64-bit data bus. (aligned 8-byte load/store is guaranteed atomic as far back as P5, e.g. x87 or MMX.) 64-bit in this case refers to the width of pointers, and of integer registers.
Further reading:
David Kanter's Haswell deep dive compares data-path widths in Haswell vs. SnB cores.
What Every Programmer Should Know About Memory (but note that much of the software-prefetch stuff is obsolete, modern CPUs have better HW prefetchers than Pentium4). Still essential reading, especially if you want to understand how CPUs are connected to DDR2/3/4 memory.
Other performance links in the x86 tag wiki.
Enhanced REP MOVSB for memcpy for more about x86 memory bandwidth. Note especially that single-threaded bandwidth can be limited by max_concurrency / latency, rather than by the DRAM controller, especially on a many-core Xeon (higher latency to L3 / memory).