Questions tagged [simd]

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements. CPU vector instruction sets include: x86 SSE and AVX, ARM NEON, and PowerPC AltiVec. To efficiently use SIMD instructions, data needs to be in structure-of-arrays form and should occur in longer streams. Naively "SIMD optimized" code frequently surprises by running slower than the original.

2540 questions

319

votes

7 answers

Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?

I was reading Agner Fog's optimization manuals, and I came across this example: double data[LEN]; void compute() { const double A = 1.1, B = 2.2, C = 3.3; int i; for(i=0; i

asked May 19 '22 at 14:39

ttsiodras

10,602
6
55
71

317

votes

12 answers

How to compile Tensorflow with SSE4.2 and AVX instructions?

This is the message received from running a script to check if Tensorflow is working: I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:125]…

tensorflow x86 compiler-optimization simd compiler-options

asked Dec 22 '16 at 23:21

GabrielChu

6,026
10
27
42

314

votes

9 answers

What is "vectorization"?

Several times now, I've encountered this term in matlab, fortran ... some other ... but I've never found an explanation what does it mean, and what it does? So I'm asking here, what is vectorization, and what does it mean for example, that "a loop…

vectorization simd auto-vectorization

asked Sep 14 '09 at 15:07

Thomas Geritzma

6,337
6
25
19

171

votes

5 answers

Header files for x86 SIMD intrinsics

Which header files provide the intrinsics for the different x86 SIMD instruction set extensions (MMX, SSE, AVX, ...)? It seems impossible to find such a list online. Correct me if I'm wrong.

x86 header-files sse simd intrinsics

asked Jun 27 '12 at 14:44

fredoverflow

256,549
94
388
662

108

votes

3 answers

Why is vectorization, faster in general, than loops?

Why, at the lowest level of the hardware performing operations and the general underlying operations involved (i.e.: things general to all programming languages' actual implementations when running code), is vectorization typically so dramatically…

performance language-agnostic vectorization simd low-level

asked Jan 29 '16 at 18:55

Ben Sandeen

1,403
3
14
17

votes

5 answers

Fastest way to do horizontal SSE vector sum (or other reduction)

Given a vector of three (or four) floats. What is the fastest way to sum them? Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it? What's the cost to moving to the FPU, then faddp, faddp?…

assembly optimization floating-point sse simd

asked Aug 09 '11 at 13:16

FeepingCreature

3,648
2
26
25

votes

8 answers

Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD

If I have a 64-bit integer that I'm interpreting as an array of packed 8-bit integers with 8 elements. I need to subtract the constant 1 from each packed integer while handling overflow without the result of one element affecting the result of…

c++ c bit-manipulation simd swar

asked Jan 07 '20 at 23:56

cam-white

votes

1 answer

C# and SIMD: High and low speedups. What is happening?

Introduction of the problem I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructions. The problem is that I am getting strange…

c# performance x86-64 simd avx

asked Jul 09 '19 at 11:42

Willem124

votes

3 answers

Parallel for vs omp simd: when to use each?

OpenMP 4.0 introduces a new construct called "omp simd". What is the benefit of using this construct over the old "parallel for"? When would each be a better choice over the other? EDIT: Here is an interesting paper related to the SIMD directive.

c++ c performance openmp simd

asked Feb 03 '13 at 15:29

zr.

7,528
11
50
84

votes

2 answers

How to choose AVX compare predicate variants

In the Advanced Vector Extensions (AVX) the compare instructions like _m256_cmp_ps, the last argument is a compare predicate. The choices for the predicate overwhelm me. They seem to be a tripple of type, ordering, signaling. E.g. _CMP_LE_OS is…

simd avx

asked Jun 07 '13 at 15:52

Bram

7,440
3
52
94

votes

5 answers

SSE SSE2 and SSE3 for GNU C++

Is there a simple tutorial for me to get up to speed in SSE, SSE2 and SSE3 in GNU C++? How can you do code optimization in SSE?

c++ optimization simd sse sse2

asked Mar 19 '09 at 07:32

yoitsfrancis

4,278
14
44
73

votes

5 answers

Where can I find an official reference listing the operation of SSE intrinsic functions?

Is there an official reference listing the operation of the SSE intrinsic functions for GCC, i.e. the functions in the <*mmintrin.h> header files?

c++ c gcc sse simd

asked Aug 23 '11 at 06:07

NGaffney

1,542
1
15
16

votes

5 answers

Getting started with Intel x86 SSE SIMD instructions

I want to learn more about using the SSE. What ways are there to learn, besides the obvious reading the Intel® 64 and IA-32 Architectures Software Developer's Manuals? Mainly I'm interested to work with the GCC X86 Built-in Functions.

c gcc x86 sse simd

asked Sep 07 '09 at 14:42

Liran Orevi

4,755
7
47
64

votes

8 answers

How to determine if memory is aligned?

I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this: void sse_func(const float* const ptr, int len){ if( ptr is aligned ) { …

c optimization memory sse simd

asked Dec 13 '09 at 23:15

user229898

3,057
3
19
9

votes

4 answers

ARM Cortex-A8: Whats the difference between VFP and NEON

In ARM Cortex-A8 processor, I understand what NEON is, it is an SIMD co-processor. But is VFP(Vector Floating Point) unit, which is also a co-processor, works as a SIMD processor? If so which one is better to use? I read few links such as…

arm simd neon cortex-a8

asked Nov 04 '10 at 13:16

HaggarTheHorrible

7,083
20
70
81

2 3

…

99 100 Next