Questions tagged [sse3]

SSE3, Streaming Single Instruction Multiple Data Extensions 3, is the third iteration of the SSE instruction set for the (x86) architecture.

SSE3, Streaming Single Instruction Multiple Data Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU. In April 2005, AMD introduced a subset of SSE3 in revision E (Venice and San Diego) of their Athlon 64 CPUs.

28 questions
15
votes
3 answers

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this: s=0; for (i=0; i<32; i++) { s = s + a[i]; } However, its taking more time, since my application is a real-time application requiring…
gpuguy
  • 4,607
  • 17
  • 67
  • 125
14
votes
2 answers

Optimizing code using Intel SSE intrinsics for vectorization

This is my very first time working with SSE intrinsics. I am trying to convert a simple piece of code into a faster version using Intel SSE intrinsic (up to SSE4.2). I seem to encounter a number of errors. The scalar version of the code is: (simple…
PGOnTheGo
  • 805
  • 1
  • 11
  • 25
12
votes
3 answers

SSE instruction set not enabled

I am getting trouble with this error: "SSE instruction set not enabled". How I can figure this out? I have ACER i7, Ubuntu 11.10, please any one can help me? Any help will be appreciated! Also running: sudo cat /proc/cpuinfo | grep…
ksolid
  • 151
  • 1
  • 2
  • 5
12
votes
1 answer

How does _mm_mwait work?

How does _mm_mwait from pmmintrin.h work? (I mean not the asm for it, but action and how this action is taken in NUMA systems. The store monitoring is easy to implement only on bus-based SMP systems with snooping of bus.) What processors does…
osgx
  • 90,338
  • 53
  • 357
  • 513
9
votes
1 answer

(Vec4 x Mat4x4) product using SIMD and improvements

I am writing a complex simulation program and it apprears that the most time consumming routine is the one for multiplying a four-vector (float4) with a 4x4 matrix. I need to run this program on several computers, which are more or less old. That is…
Asohan
  • 101
  • 1
  • 7
5
votes
1 answer

What is the difference between _mm_movehdup_ps and _mm_shuffle_ps in this case?

If my understanding is correct, _mm_movehdup_ps(a) gives the same result as _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 1, 3, 3))? Is there a performance difference the two?
ThreeStarProgrammer57
  • 2,906
  • 2
  • 16
  • 24
5
votes
2 answers

Performance degrade while using alternative for Intel intrinsics SSSE3

I am developing a performance critical application which has to be ported into Intel Atom processor which just supports MMX, SSE, SSE2 and SSE3. My previous application had support for SSSE3 as well as AVX now I want to downgrade it to Intel Atom…
Harrisson
  • 255
  • 2
  • 21
5
votes
2 answers

How to enable SSSE3 intrinsics but disable their use in compiler optimization

I have a code that uses SSSE3 intrinsic commands (note the triple S) and a runtime check whether to use it, therefore I assumed that the application should execute on CPUs without SSSE3 support. However, when using -mssse3 with -O1 optimization the…
Mark S
  • 235
  • 3
  • 11
4
votes
1 answer

Converting 24 to 16 bit audio using SSE/simd instructions

I wonder if there is any fast method to do a 24 bit to 16 bit quantization on an array of audio samples (using intrinsics or asm). Source format is signed 24 le. Update : Managed to get the conversion done like described : static void __cdecl…
ohrfritz
  • 41
  • 2
4
votes
1 answer

C++ SSE3 instruction set not enabled

I'm trying to work up some hidden markov code in c++ using the HMMlib library from http://www.cs.au.dk/~asand/?page_id=152 I am using an ubuntu 12.04, with gcc / g++ 4.6 My compile step instruction is: g++ -I/usr/local/boost_1_52_0 -I../…
Aditya Sihag
  • 5,057
  • 4
  • 32
  • 43
3
votes
1 answer

convertion of four packed single precision floating point to unsigned double words in x86-SSE

Is there a way to convert four packed single precision floating point values to four double words in x86 with SSE extension? The closest instruction would be CVTPS2PI, but it cannot be executed on two xmm registers, instead should be given as…
3
votes
1 answer

Intel intrinsics support for Atom cloverview processor

I have an application which was designed for Sandbridge processors using SSE to AVX, now I want the same application to run on Atom Processors. I was recently browsing net for intrinsic support for Atom cloverview processors. Every where it mentions…
Harrisson
  • 255
  • 2
  • 21
3
votes
2 answers

Memory Access Violations When Using SSE Operations

I've been trying to re-implement some existing vector and matrix classes to use SSE3 commands, and I seem to be running into these "memory access violation" errors whenever I perform a series of operations on an array of vectors. I'm relatively new…
Eric Foote
  • 76
  • 5
2
votes
2 answers

How do I enable the SSE3/SSE4.1 instruction set in Visual Studio 2008?

I tried to follow: Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set But the only options I got were - SSE or SSE2. Thanks.
Igor
  • 21
  • 1
  • 2
2
votes
1 answer

SIMD integer store

I am writing a program using SSE instructions to multiply and add integer values. I did the same program with floats but I am missing an instruccion for my integer version. With floats, after I have finished all my operations, I return de values…
Thudor
  • 349
  • 2
  • 7
1
2