Suppose I'm interested in writing, or even just reading and understanding, some assembly code and its execution performance, from the perspective of a particular mainstream x86_64 processor architecture, e.g. Intel Nehalem, AMD K10, Intel Haswell, etc. Today's processors appear to be really complex, with flag stalls, out-of-order execution, dependency chain issues, different execution ports able to handle different subsets of opcodes in parallel, etc., and no two architectures quite run the code the same way.
What simulators/tools can I use to simulate executing some assembly code and see, for some target architecture, which lines execute at which clock ticks causing whatever latency on which execution ports, ideally with explanations for why certain things were delayed or reordered? Extra nice but not required would be being able to see branch prediction fail effects, L1/L2/L3 cache over time, and opcode dependency chains. If there's a way to trigger the cpu itself to run slow in some sort of profiling mode and report on this sort of thing in real time that would also work. I'm especially interested in Intel and AMD platforms, though if there's nothing for those I guess I'm interested in other architectures.