Yes, the pipeline is shared, so it's possible for each the two load execution units in a physical core to be running a uop from a different logical core, accessing L1d in parallel. (e.g. https://www.realworldtech.com/haswell-cpu/5/ / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram)
Off-core (L2 miss) bandwidth doesn't scale with number of logical cores, and one thread per core can fairly easily saturated it, especially with SIMD, if your code has high throughput (not bottlenecking on latency or branch misses), and low computational intensity (ALU work per load of data into registers. Or into L1d or L2 cache, whichever you're cache-blocking for). e.g. like a dot product.
Well-tuned high-throughput (instructions per cycle) code like linear algebra stuff (especially matmul) often doesn't benefit from more than 1 thread per physical core, instead suffering more cache misses when two threads are competing for the same L1d / L2 cache.
Cache-blocking aka loop tiling can help a lot, if you can loop again over a smaller chunk of data while it's still hot in cache. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? (most of it).