That's not how CPUs work; a load buffer waiting for an incoming line will get the bytes it needs directly. e.g. on Intel CPUs, from the LFB (Line Fill Buffer) waiting for that cache line; LFBs are the buffers that track incoming and outgoing cache lines, allocated on cache miss.
The load producing a value happens in parallel with populating the line in L1d. It doesn't have to replay1 the load uop to have the load execution unit redo the TLB lookup and check permissions again. That strategy would cost extra latency for L2 hits, as well as some load throughput when you had a mix of loads that hit in L1d vs. L2.
On CPUs where the path between L2 and L1d isn't as wide as a full cache line (e.g. Intel before Skylake, and many earlier CPUs where the data paths might only be 32 bits wide), "early restart" and "critical word first" are also common techniques to reduce L1d miss latency a little: Early Restart lets the load produce a value when the necessary byte(s) arrive, not waiting for the entire cache line to arrive (or be written to L1d). Critical Word First lets the full-line transfer start with the chunk containing the byte(s) wanted by the demand load that missed. (If later loads miss in the same cache line, presumably the first one already initiated the request and its word will arrive first.)
See also:
Even if you did hypothetically design a simplistic CPU that worked the way you were imagining, you'd probably design its performance counters to not count the load replay as another load that hits in cache. Because that's not what people want to know about, that's just a bad implementation technique for getting incoming data into the execution pipeline. Of course ignoring that a "simplistic" CPU probably wouldn't be designed with performance counters in the first place, so maybe you'd talk about cycle-accurate simulation of it.
Footnote 1: On Intel CPUs, when loads take extra time due to cache misses or other effects, uops waiting for the load data might get replayed. Cache-line-split loads or especially 4k-split loads might even need another cycle on the load port (e.g. to do another TLB lookup for the other half which is potentially coming from another page), but the load uop itself doesn't actually re-dispatch from the RS (scheduler). See also Are load ops deallocated from the RS when they dispatch, complete or some other time?.