My interpretation is that, on a TLB miss, the PMH walks the page table and performs stuffed loads into the load buffer; if it encounters accessed or dirty bits that need to be set it communicates an exception code which will mark the load for retiring (assumedly it also places the virtual address whose load requires assistance somewhere that is accessible to the MSROM routine).
When it retires is when the exception is triggered which causes the pipeline to be flushed and a specific MSROM special uop to manifest itself at the allocate stage which will reperform the whole walk (no idea why the PMH can't perform stuffed writes itself but this is the general belief as to what happens). It does seem odd because it means that there would have to be a uop that indicates the store is to a physical address and there wouldn't have to be such a uop if the PMH performed stuffed stores. The special MSROM uop issue would have to jump to the page fault exception routine if it encounters an invalid or protected bit. If no dirty / accessed bits need to be set then it will be the PMH that communicates the page fault exception code.
The paper suggests that the load just continues and the L1d cache controller just returns—instead of a dummy value or 0 with the exception code of the cancelled load—the contents of a line-fill buffer which might still contain contents populated by the other logical core (which can then be used to transiently modify the cache for cache timing attacks).
Is this just a silly mistake on Intel's part; an unprecedented side-effect?