Highest Voted 'micro-architecture' Questions

21

votes

3 answers

Does memory dependence speculation prevent BN_consttime_swap from being constant-time?

Context The function BN_consttime_swap in OpenSSL is a thing of beauty. In this snippet, condition has been computed as 0 or (BN_ULONG)-1: #define BN_CONSTTIME_SWAP(ind) \ do { \ t = (a->d[ind] ^ b->d[ind]) & condition; \ …

asked Mar 19 '15 at 15:45

Pascal Cuoq

79,187
7
161
281

20

votes

2 answers

What are my available march/mtune options?

Is there a way to get gcc to output the available -march=arch options? I'm getting build errors (tried -march=x86_64) and I don't know what my options are. The compiler I'm using is a proprietary wrapper around gcc that doesn't seem to like…

gcc command-line x86 compiler-flags micro-architecture

asked Nov 05 '18 at 15:01

Brydon Gibson

1,179
3
11
22

14

votes

2 answers

Why isn't there a data bus which is as wide as the cache line size?

When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64) This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8…

caching memory cpu-architecture cpu-cache micro-architecture

asked Aug 27 '16 at 14:10

Mike76

899
1
9
31

10

votes

1 answer

Any reason to use BX R over MOV pc, R except thumb interwork pre ARMv7?

Linux defines an assembler macro to use BX on CPUs that support it, which makes me suspect there is some performance reason. This answer and the Cortex-A7 MPCore Technical Reference Manual also states that it helps with branch prediction. However my…

assembly arm cpu-architecture branch-prediction micro-architecture

asked Aug 09 '20 at 00:00

Timothy Baldwin

3,551
1
14
23

10

votes

2 answers

How do the store buffer and Line Fill Buffer interact with each other?

I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the…

x86 cpu-architecture cpu-cache micro-architecture cpu-mds

asked Apr 09 '20 at 20:34

Daniel Näslund

2,300
3
22
27

10

votes

1 answer

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as…

x86 x86-64 cpu-architecture memory-barriers micro-architecture

asked Sep 23 '19 at 21:29

Raghu

479
3
13

10

votes

1 answer

Return stack buffer?

As I understood, Return Stack Buffer only supports 4 to 16 entries (from wiki: http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_function_returns) and is not pair of key-value(based on indexing by position of ret instruction). Is it…

x86 cpu cpu-architecture branch-prediction micro-architecture

asked Dec 05 '12 at 12:08

user683595

397
1
3
10

9

votes

0 answers

Are two store buffer entries needed for split line/page stores on recent Intel?

It is generally understood that one store buffer entry is allocated per store, and this store buffer entry holds the store data and physical address1. In the case that a store crosses a 4096-byte page boundary, two different translations may be…

x86 intel cpu-architecture micro-optimization micro-architecture

asked Apr 13 '20 at 02:44

BeeOnRope

60,350
16
207
386

9

votes

1 answer

Adding a redundant assignment speeds up code when compiled without optimization

I find an interesting phenomenon: #include #include int main() { int p, q; clock_t s,e; s=clock(); for(int i = 1; i < 1000; i++){ for(int j = 1; j < 1000; j++){ for(int k = 1; k < 1000; k++){ …

performance assembly x86 cpu-architecture micro-architecture

asked Mar 09 '18 at 08:41

helloqiu

133
6

7

votes

0 answers

In which conditions the L1 IP-based stride prefetcher will be triggered?

Intel hardware Prefetcher Intel website shows that there are four kinds of hardware prefechers. The prefetcher controlled by bit 3 is the L1 stride prefetcher. I am running a test code to test what's the trigger condition of the stride prefetcher.…

x86 intel cpu-cache prefetch micro-architecture

asked Feb 24 '21 at 02:48

JasperMa

71
4

7

votes

2 answers

How does the indexing of the Ice Lake's 48KiB L1 data cache work?

The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture. 1 Software-visible latency/bandwidth will vary depending on access patterns and other factors. This baffled…

x86 intel cpu-architecture cpu-cache micro-architecture

asked Jan 19 '20 at 12:25

Margaret Bloom

41,768
5
78
124

7

votes

1 answer

Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

First I have the below setup on an IvyBridge, I will insert measuring payload code in the commented location. The first 8 bytes of buf store the address of buf itself, I use this to create loop-carried dependency: section .bss align 64 buf: …

assembly x86 micro-optimization microbenchmark micro-architecture

asked Jan 08 '19 at 03:53

user10865622

455
3
11

6

votes

1 answer

Are any instructions affected by IA32_UARCH_MISC_CTL[DOITM] in existing CPUs?

In the document titled Data Operand Independent Timing Instruction Set Architecture (ISA) Guidance Intel is introducing a new IA32_UARCH_MISC_CTL MSR where toggling bit 0 enables the "Data Operand Independent Timing Mode" (DOITM). This MSR is…

x86 cpu-architecture intel micro-architecture

asked May 22 '23 at 19:16

amonakov

2,324
11
23

6

votes

1 answer

What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states. Two of the items in this taxonomy are: Short scoreboard - scoreboard dependency on an MIO queue operation. Long scoreboard -…

cuda gpu gpgpu micro-architecture nsight-compute

asked Feb 09 '21 at 17:14

einpoklum

118,144
57
340
684

6

votes

1 answer

How to tell length of an x86-64 instruction opcode using CPU itself?

I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction. But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the…

x86 x86-64 cpu-architecture opcode micro-architecture

asked Jul 26 '18 at 19:25

MikeF

1,021
9
29

Questions tagged [micro-architecture]