Questions tagged [sse4]

Intel's Streaming SIMD Extensions 4 instruction set for x86 processors.

Intel's Streaming SIMD Extensions 4 instruction set for Intel Core architecture x86 processors and AMD's K10 x86 processors. It introduces 47 new SSE instructions in total.

These instructions encompass Intel's SSE4.1 and SSE4.2 instruction sets as well as AMD's SSE4a instruction set. More detailed information on the new instruction can be found in both Intel's and AMD's developer manuals or more conveniently on Wikipedia.

55 questions
14
votes
2 answers

Optimizing code using Intel SSE intrinsics for vectorization

This is my very first time working with SSE intrinsics. I am trying to convert a simple piece of code into a faster version using Intel SSE intrinsic (up to SSE4.2). I seem to encounter a number of errors. The scalar version of the code is: (simple…
PGOnTheGo
  • 805
  • 1
  • 11
  • 25
13
votes
3 answers

SSE multiplication 16 x uint8_t

I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8?
Roby
  • 2,011
  • 4
  • 28
  • 55
13
votes
2 answers

Generate code for multiple SIMD architectures

I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags,…
Jens Munk
  • 4,627
  • 1
  • 25
  • 40
12
votes
2 answers

MOVDQU instruction + page boundary

I have a simple test program that loads an xmm register with the movdqu instruction accessing data across a page boundary (OS = Linux). If the following page is mapped, this works just fine. If it's not mapped then I get a SIGSEGV, which is…
user3299291
  • 121
  • 3
11
votes
1 answer

What's the difference between __popcnt() and _mm_popcnt_u32()?

MS Visual C++ supports 2 flavors of the popcnt instruction on CPUs with SSE4.2: __popcnt() _mm_popcnt_u32() The only difference I found was that the docs for __popcnt() are marked as "Microsoft Specific", and _mm_popcnt_u32() seems to be an…
Adi Shavit
  • 16,743
  • 5
  • 67
  • 137
9
votes
1 answer

Can PTEST be used to test if two registers are both zero or some other condition?

What can you do with SSE4.1 ptest other than testing if a single register is all-zero? Can you use a combination of SF and CF to test anything useful about two unknown input registers? What is PTEST good for? You'd think it would be good for…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
9
votes
3 answers

Optimal SSE unsigned 8 bit compare

I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2). The most common case I'm working on is comparing for > 0U, e.g. _mm_cmpgt_epu8(v, _mm_setzero_si128()) // #1 (which of course can also…
Paul R
  • 208,748
  • 37
  • 389
  • 560
9
votes
2 answers

_mm_crc32_u64 poorly defined

Why in the world was _mm_crc32_u64(...) defined like this? unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v ); The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). …
David I. McIntosh
  • 2,038
  • 4
  • 23
  • 45
9
votes
3 answers

How to enable support for the POPCNT instruction / intrinsic on my computer?

I tried to run the following program in my computer (Fedora 17 32bit). How can I enable my system to support the popcnt instruction for fast population count? #include #include int main(void) { int pop =…
afancy
  • 673
  • 4
  • 10
  • 18
8
votes
1 answer

Make a Dockerfile that compiles a Tensorflow binary to use: SSE4.1, SSE4.2 and AVX instructions

So, one of the porpuses of docker is to easily deploy an environment to test software right? Can anybody tell me how to compile a Tensorflow binary to use: SSE4.1, SSE4.2 on a docker file?. Can anybody point me to a docker file that does that? if it…
Diego Orellana
  • 994
  • 1
  • 9
  • 20
8
votes
1 answer

Simulating packusdw functionality with SSE2

I'm implementing a fast x888 -> 565 pixel conversion function in pixman according to the algorithm described by Intel [pdf]. Their code converts x888 -> 555 while I want to convert to 565. Unfortunately, converting to 565 means that the high bit is…
mattst88
  • 1,462
  • 13
  • 21
7
votes
1 answer

SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings). On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms…
Xtra Coder
  • 3,389
  • 4
  • 40
  • 59
7
votes
1 answer

What is the fastest way to do a SIMD gather without AVX(2)?

Assuming I have SSE to SSE4.1, but not AVX(2), what is the fastest way to load a packed memory layout like this (all 32-bit integers): a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 Into four vectors a, b, c, d? a: {a0, a1, a2, a3} b: {b0, b1, b2,…
orlp
  • 112,504
  • 36
  • 218
  • 315
6
votes
1 answer

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly. I'm writing software that operates on bytes and words…
user1649948
  • 651
  • 4
  • 12
6
votes
1 answer

How to compare more than two numbers in parallel?

Is it possible to compare more than a pair of numbers in one instruction using SSE4? Intel Reference says the following about PCMPGTQ PCMPGTQ — Compare Packed Data for Greater Than Performs an SIMD compare for the packed quadwords in the…
Lazer
  • 90,700
  • 113
  • 281
  • 364
1
2 3 4