Why isn't RDTSC a serializing instruction?

Question

The Intel manuals for the RDTSC instruction warn that out of order execution can change when RDTSC is actually executed, so they recommend inserting a CPUID instruction in front of it because CPUID will serialize the instruction stream (CPUID is never executed out of order). My question is simple: if they had the ability to make instructions serializing, why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings. Is there a situation under which you would not want to precede it with a serializing instruction?

Newer Intel CPUs have a separate RDTSCP instruction that is serializing. Intel opted to introduce a separate instruction rather than change the behavior of RDTSC, which suggests to me that there has to be some situation where a potentially out of order timing is what you want. What is it?

Note the question/assertion posed at the end: "..there has to be some situation where a potentially out of order timing is what you want. What is it?" — , Aug 22 '12 at 03:14
**`RDTSCP` isn't serializing** the way `CPUID` is. It's only a one-way barrier for instructions, and [doesn't stop later instructions from executing before it (and other earlier instructions)](http://stackoverflow.com/questions/23280697/is-there-a-cheaper-serializing-instruction-than-cpuid). — Peter Cordes, May 26 '16 at 19:06
IMHO, although it is commonly recognized that `rdtsc` is not a serializing instruction, they serialize themselves. For back-to-back `rdtsc`s, the latter never carries out before the former, and this guarantees any time gap we measured by `rdtsc` is always positive. — foool, Jul 12 '22 at 08:29
@foool Is this documented somewhere? Since rdtsc doesn't consume any registers I'd expect that the normal dependency-chain ordering enforcement in the CPU would not apply. I've noticed this is a problem on the compiler level -- compilers will reorder rdtsc even w/ memory barriers b/c there's no dependency chain or memory access. — Joseph Garvin, Jul 16 '22 at 21:36
Actually come to think of it I kinda remember reading somewhere that consensus was this doesn't happen b/c there is no advantage to doing it. But it'd be better if there was a source. — Joseph Garvin, Jul 17 '22 at 00:50

score 13 · Answer 1 · edited Jun 20 '20 at 09:12

The time stamp counter was introduced on the Pentium microarchitecture. Out-of-order execution didn't show up until the Pentium Pro. Intel could have made rdtsc serializing (architecturally or internally), but it seems that they decided to keep it non-serializing, which is OK for general-purpose time measurements, and leave it up to the programmer to add serializing instructions if necessary. This is good for reducing the overhead of the measurement.

That's actually confirmed in the document you provide, with the following comment about Pentium and Pentium/MMX (in 4.2, slightly paraphrased):

All of the rules and code samples described in section 4.1 (Pentium Pro and Pentium II) also apply to the Pentium and Pentium/MMX. The only difference is, the CPUID instruction is not necessary for serialization.

And, from Wikipedia:

The Time Stamp Counter is a 64-bit register present on all x86 processors since the Pentium.

: : :

Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.

One of the two uses of RDTSCP is to give you the processor ID in addition to the time stamp information (it's right there in the name Read Time-Stamp Counter *AND* Processor ID), which is useful on systems with unsynced TSCs across cores or sockets (See: How to get the CPU cycle count in x86_64 from C++?). The additional serialization properties of rdtscp makes it more convenient at the end of the region of interest (See: Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?).

I disagree that the document confirms it. Prior to out of order execution, there was no concept of a serializing instruction since instructions were always serial. So when they introduced out of order execution if they had made RTDSC a serializing instruction there wouldn't have been any observable change in its behavior from earlier processors. — Joseph Garvin, Aug 22 '12 at 03:27
@Joseph, I think you misunderstand what I'm saying it confirms. I'm not stating that what they did was correct, just that the timelines for timestamp counters and OOO execution were confirmed by that document. In fact, I believe what they did was wrong because they regressed the behaviour of RDTSC - it worked on the earlier processor and not on the latter one. I suspect someone didn't take into account OOOE until it was too late but that's just supposition on my part. — paxdiablo, Aug 22 '12 at 04:09
Ah, yes, I agree then, but my goal is to figure out whether it's an error on their part or something deliberate :) — Joseph Garvin, Aug 22 '12 at 04:47
Intel? Make a mistake? Not a chance. As sure as 4195835 divided by 3145727 equals 1.333739068902037589, they're infallible. Foof, I'm stunned that you would think this possible :-) — paxdiablo, Aug 22 '12 at 05:34

Danny · Accepted Answer · 2012-08-23T19:28:41.343

11

If you are trying to use rdtsc to see if a branch mispredicts, the non-serializing version is what you want.

//math here
rdtsc
branch if zero to done
//do some work that always takes 1 cycle
done: rdtsc

If the branch is predicted correctly, the delta will be small (maybe even negative?). If the branch is mispredicted, the delta will be large.

With the serializing version, the branch condition will be resolved because the first rdtsc waits for the math to finish.

edited Aug 23 '12 at 19:28

answered Aug 23 '12 at 11:09

Danny

405
2
9

Very interesting. You mean, assuming the branch isn't taken (since then the second rdtsc wouldn't run since we'd jump somewhere), and we want to check if it not being taken is predicted correctly, the second rdtsc will execute at the same time as the branch check (since the prediction is so the processor can pipeline), otherwise it won't be and the time will be larger. This assumes the CPU never speculatively executes both possibilities, but that was certainly true at the time (and maybe still is?). – Joseph Garvin Aug 23 '12 at 15:48
I changed up the example to make the second rdtsc always execute. – Danny Aug 23 '12 at 19:30
I don't think this is correct because `rdtsc` was not really designed to determine whether a branch was predicted correctly. Although the technique you described you may work, but that's not by design. The purpose of `rdtsc` is to provide a low-overhead, high-resolution method for measuring time of a region of code. – Hadi Brais Jan 18 '20 at 05:34

score 7 · Answer 3 · answered Aug 22 '12 at 02:54

7

why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings

Well, most of the time it's to get high-resolution timestamps. At least some of the time, these timestamps are used for performance metrics. Making the intruction serializing would likely require a pipeline flush, which can be very expensive for CPU-bound applications.

Intel opted to introduce a separate instruction rather than change the behavior of RDTSC, which suggests to me that there has to be some situation where a potentially out of order timing is what you want.

Changing the behavior is almost always undesirable. Intel's customers would be disappointed to find out that RDTSC does something different on newer parts.

answered Aug 22 '12 at 02:54

Brian Cain

14,403
3
50
88

4

Actually, they'd be used to that. The behaviour changed when switching from Pentium to Pentium Pro - it stopped giving useful results without serialising :-) But you're dead right about it being undesirable. – paxdiablo Aug 22 '12 at 03:13
Making the instruction serializing would require a pipeline flush, but it seems that it's also necessary for your high resolution timestamps to be usable, thus my confusion. The purpose of getting the timestamps is to compare them or get the difference between them -- if you allow the instruction to be pipelined then you're not always measuring the same thing, right? – Joseph Garvin Aug 22 '12 at 03:31
@JosephGarvin: In a pipelined CPU, the time required to execute a piece of code often isn't a clearly-defined number. Flushing the cache before taking measurements will give cause the measurements to yield a well-defined number, but that number will have less relationship to real-world performance than would a number measured without the cache flushing. – supercat Nov 02 '14 at 19:27
1

@JosephGarvin and Brian: A serializing `rdtsc` would not affect the resolution (it would still count at the TSC frequency), but it would increase the overhead of the measurement, which could be significant in some cases compared to the time of the region. – Hadi Brais Jan 18 '20 at 05:40

score 2 · Answer 4 · answered Aug 22 '12 at 17:18

2

As paxdiably explains, RDTSC predates the concept of "serializing" instructions because it was implemented on an in-order CPU. Adding that behavior later would change the memory access behavior of code using it, and thus be incompatible for some purposes.

Instead, more recent CPUs have a related RDTSCP instruction that is defined as serializing (actually stronger: it promises to wait until all instructions issued before it have completed, not just that memory accesses have been done), for exactly this reason. Use that if you are running on modern CPUs.

answered Aug 22 '12 at 17:18

Andy Ross

11,699
1
34
31

2

"Adding that behavior later would change the memory access behavior of code using it, and thus be incompatible for some purposes." Except that I don't think it would. If they had had an out of order CPU before with rdtsc, then yes, making it serializing in later CPUs would be a behavior change. But when they introduced out of order execution, there couldn't be any older programs that depended on rdtsc being serializing because serializing as a concept only exists when you have out of order execution. So my thinking right now is that it was an oversight by Intel. – Joseph Garvin Aug 22 '12 at 19:37
1

`rdtscp` isn't serializing the way `CPUID` is. It's only a one-way barrier for instructions, and doesn't stop later instructions from passing it and other earlier instructions. – Peter Cordes May 26 '16 at 19:04
1

_"The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible. But it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed."_ – geometrian Mar 02 '19 at 18:52
1

A "serializing instruction" in x86 terminology means it drains the ROB *and* the store buffer, and not letting any later instructions execute ahead of it. Like `cpuid`. `rdtscp` is much *weaker* than this, only draining the ROB but not the store buffer. It like `lfence; rdtsc`, not `lfence;rdtsc;lfence` which you sometimes actually want. You normally wouldn't want it for you to wait for the store buffer to drain; you can wait for that with `mfence`. – Peter Cordes Jan 18 '20 at 07:23
Update: Serializing also means discarding older instructions fetched, in case of cross-modifying code. [Is there a cheaper serializing instruction than cpuid?](https://stackoverflow.com/a/75456027) . even `mfence ; lfence` wouldn't be safe for that. Anyway, as I said, `rdtscp` doesn't wait for older stores to commit from the store buffer to L1d cache the way `mfence; rdtscp` would. – Peter Cordes Feb 20 '23 at 08:02

Why isn't RDTSC a serializing instruction?

4 Answers4

Linked