Questions tagged [micro-architecture]
107 questions
21
votes
3 answers
Does memory dependence speculation prevent BN_consttime_swap from being constant-time?
Context
The function BN_consttime_swap in OpenSSL is a thing of beauty. In this snippet, condition has been computed as 0 or (BN_ULONG)-1:
#define BN_CONSTTIME_SWAP(ind) \
do { \
t = (a->d[ind] ^ b->d[ind]) & condition; \
…

Pascal Cuoq
- 79,187
- 7
- 161
- 281
20
votes
2 answers
What are my available march/mtune options?
Is there a way to get gcc to output the available -march=arch options? I'm getting build errors (tried -march=x86_64) and I don't know what my options are.
The compiler I'm using is a proprietary wrapper around gcc that doesn't seem to like…

Brydon Gibson
- 1,179
- 3
- 11
- 22
14
votes
2 answers
Why isn't there a data bus which is as wide as the cache line size?
When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64)
This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8…

Mike76
- 899
- 1
- 9
- 31
10
votes
1 answer
Any reason to use BX R over MOV pc, R except thumb interwork pre ARMv7?
Linux defines an assembler macro to use BX on CPUs that support it, which makes me suspect there is some performance reason.
This answer and the Cortex-A7 MPCore Technical Reference Manual also states that it helps with branch prediction.
However my…

Timothy Baldwin
- 3,551
- 1
- 14
- 23
10
votes
2 answers
How do the store buffer and Line Fill Buffer interact with each other?
I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the…

Daniel Näslund
- 2,300
- 3
- 22
- 27
10
votes
1 answer
how are barriers/fences and acquire, release semantics implemented microarchitecturally?
A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as…

Raghu
- 479
- 3
- 13
10
votes
1 answer
Return stack buffer?
As I understood, Return Stack Buffer only supports 4 to 16 entries (from wiki: http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_function_returns) and is not pair of key-value(based on indexing by position of ret instruction). Is it…

user683595
- 397
- 1
- 3
- 10
9
votes
0 answers
Are two store buffer entries needed for split line/page stores on recent Intel?
It is generally understood that one store buffer entry is allocated per store, and this store buffer entry holds the store data and physical address1.
In the case that a store crosses a 4096-byte page boundary, two different translations may be…

BeeOnRope
- 60,350
- 16
- 207
- 386
9
votes
1 answer
Adding a redundant assignment speeds up code when compiled without optimization
I find an interesting phenomenon:
#include
#include
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
…

helloqiu
- 133
- 6
7
votes
0 answers
In which conditions the L1 IP-based stride prefetcher will be triggered?
Intel hardware Prefetcher Intel website shows that there are four kinds of hardware prefechers. The prefetcher controlled by bit 3 is the L1 stride prefetcher. I am running a test code to test what's the trigger condition of the stride prefetcher.…

JasperMa
- 71
- 4
7
votes
2 answers
How does the indexing of the Ice Lake's 48KiB L1 data cache work?
The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture.
1 Software-visible latency/bandwidth will vary depending on access patterns and other factors.
This baffled…

Margaret Bloom
- 41,768
- 5
- 78
- 124
7
votes
1 answer
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
First I have the below setup on an IvyBridge, I will insert measuring payload code in the commented location. The first 8 bytes of buf store the address of buf itself, I use this to create loop-carried dependency:
section .bss
align 64
buf: …

user10865622
- 455
- 3
- 11
6
votes
1 answer
Are any instructions affected by IA32_UARCH_MISC_CTL[DOITM] in existing CPUs?
In the document titled Data Operand Independent Timing Instruction Set Architecture (ISA) Guidance Intel is introducing a new IA32_UARCH_MISC_CTL MSR where toggling bit 0 enables the "Data Operand Independent Timing Mode" (DOITM). This MSR is…

amonakov
- 2,324
- 11
- 23
6
votes
1 answer
What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?
With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states.
Two of the items in this taxonomy are:
Short scoreboard - scoreboard dependency on an MIO queue operation.
Long scoreboard -…

einpoklum
- 118,144
- 57
- 340
- 684
6
votes
1 answer
How to tell length of an x86-64 instruction opcode using CPU itself?
I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction.
But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the…

MikeF
- 1,021
- 9
- 29