Questions tagged [sse4]

Intel's Streaming SIMD Extensions 4 instruction set for x86 processors.

Intel's Streaming SIMD Extensions 4 instruction set for Intel Core architecture x86 processors and AMD's K10 x86 processors. It introduces 47 new SSE instructions in total.

These instructions encompass Intel's SSE4.1 and SSE4.2 instruction sets as well as AMD's SSE4a instruction set. More detailed information on the new instruction can be found in both Intel's and AMD's developer manuals or more conveniently on Wikipedia.

55 questions

votes

2 answers

Optimizing code using Intel SSE intrinsics for vectorization

This is my very first time working with SSE intrinsics. I am trying to convert a simple piece of code into a faster version using Intel SSE intrinsic (up to SSE4.2). I seem to encounter a number of errors. The scalar version of the code is: (simple…

c sse sse3 sse4

asked Jun 08 '12 at 16:50

PGOnTheGo

votes

3 answers

SSE multiplication 16 x uint8_t

I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8?

x86 sse simd sse4

asked Nov 19 '11 at 11:03

Roby

2,011
4
28
55

votes

2 answers

Generate code for multiple SIMD architectures

I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags,…

gcc simd avx sse4

asked Jun 10 '17 at 23:35

Jens Munk

4,627
1
25
40

votes

2 answers

MOVDQU instruction + page boundary

I have a simple test program that loads an xmm register with the movdqu instruction accessing data across a page boundary (OS = Linux). If the following page is mapped, this works just fine. If it's not mapped then I get a SIGSEGV, which is…

linux sse4

asked Feb 11 '14 at 22:49

user3299291

votes

1 answer

What's the difference between __popcnt() and _mm_popcnt_u32()?

MS Visual C++ supports 2 flavors of the popcnt instruction on CPUs with SSE4.2: __popcnt() _mm_popcnt_u32() The only difference I found was that the docs for __popcnt() are marked as "Microsoft Specific", and _mm_popcnt_u32() seems to be an…

x86 sse intrinsics sse4

asked Jun 20 '12 at 06:32

Adi Shavit

16,743
5
67
137

votes

1 answer

Can PTEST be used to test if two registers are both zero or some other condition?

What can you do with SSE4.1 ptest other than testing if a single register is all-zero? Can you use a combination of SF and CF to test anything useful about two unknown input registers? What is PTEST good for? You'd think it would be good for…

assembly x86 sse intrinsics sse4

asked Apr 30 '17 at 23:03

Peter Cordes

328,167
45
605
847

votes

3 answers

Optimal SSE unsigned 8 bit compare

I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2). The most common case I'm working on is comparing for > 0U, e.g. _mm_cmpgt_epu8(v, _mm_setzero_si128()) // #1 (which of course can also…

c x86 sse simd sse4

asked Nov 20 '15 at 10:26

Paul R

208,748
37
389
560

votes

2 answers

_mm_crc32_u64 poorly defined

Why in the world was _mm_crc32_u64(...) defined like this? unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v ); The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). …

c sse crc crc32 sse4

asked Apr 01 '13 at 22:07

David I. McIntosh

2,038
4
23
45

votes

3 answers

How to enable support for the POPCNT instruction / intrinsic on my computer?

I tried to run the following program in my computer (Fedora 17 32bit). How can I enable my system to support the popcnt instruction for fast population count? #include #include int main(void) { int pop =…

c gcc x86 sse4 population-count

asked Nov 11 '12 at 15:05

afancy

votes

1 answer

Make a Dockerfile that compiles a Tensorflow binary to use: SSE4.1, SSE4.2 and AVX instructions

So, one of the porpuses of docker is to easily deploy an environment to test software right? Can anybody tell me how to compile a Tensorflow binary to use: SSE4.1, SSE4.2 on a docker file?. Can anybody point me to a docker file that does that? if it…

docker tensorflow cpu sse4

asked Jan 29 '18 at 15:44

Diego Orellana

votes

1 answer

Simulating packusdw functionality with SSE2

I'm implementing a fast x888 -> 565 pixel conversion function in pixman according to the algorithm described by Intel [pdf]. Their code converts x888 -> 555 while I want to convert to 565. Unfortunately, converting to 565 means that the high bit is…

x86 sse intrinsics sse2 sse4

asked Jun 13 '12 at 23:14

mattst88

1,462
13
21

votes

1 answer

SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings). On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms…

c++ performance sse sse4

asked Jan 05 '14 at 16:07

Xtra Coder

3,389
4
40
59

votes

1 answer

What is the fastest way to do a SIMD gather without AVX(2)?

Assuming I have SSE to SSE4.1, but not AVX(2), what is the fastest way to load a packed memory layout like this (all 32-bit integers): a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 Into four vectors a, b, c, d? a: {a0, a1, a2, a3} b: {b0, b1, b2,…

x86 sse simd sse4

asked Oct 24 '13 at 05:34

orlp

112,504
36
218
315

votes

1 answer

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly. I'm writing software that operates on bytes and words…

c byte xeon-phi sse4 avx512

asked Jun 08 '16 at 21:56

user1649948

votes

1 answer

How to compare more than two numbers in parallel?

Is it possible to compare more than a pair of numbers in one instruction using SSE4? Intel Reference says the following about PCMPGTQ PCMPGTQ — Compare Packed Data for Greater Than Performs an SIMD compare for the packed quadwords in the…

c algorithm parallel-processing sse sse4

asked Sep 24 '12 at 04:07

Lazer

90,700
113
281
364

2 3 4 Next