Performance degrade while using alternative for Intel intrinsics SSSE3

Question

I am developing a performance critical application which has to be ported into Intel Atom processor which just supports MMX, SSE, SSE2 and SSE3. My previous application had support for SSSE3 as well as AVX now I want to downgrade it to Intel Atom processor(MMX, SSE, SSE2, SSE3).

There is a serious performance downgrade when I replace ssse3 instruction particularly _mm_hadd_epi16 with this code

RegTemp1 = _mm_setr_epi16(RegtempRes1.m128i_i16[0], RegtempRes1.m128i_i16[2], 
                          RegtempRes1.m128i_i16[4], RegtempRes1.m128i_i16[6],
                          Regfilter.m128i_i16[0],   Regfilter.m128i_i16[2],
                          Regfilter.m128i_i16[4],   Regfilter.m128i_i16[6]);

RegTemp2 = _mm_setr_epi16(RegtempRes1.m128i_i16[1], RegtempRes1.m128i_i16[3],
                          RegtempRes1.m128i_i16[5], RegtempRes1.m128i_i16[7],
                          Regfilter.m128i_i16[1],   Regfilter.m128i_i16[3],
                          Regfilter.m128i_i16[5], Regfilter.m128i_i16[7]);

RegtempRes1 = _mm_add_epi16(RegTemp1, RegTemp2);

This is the best conversion I was able to come up with for this particular instruction. But this change has seriously affected the performance of the entire program.

Can anyone please suggest a better performance efficient alternative within MMX, SSE, SSE2 and SSE3 instructions to the _mm_hadd_epi16 instruction. Thanks in advance.

The Intel Atom processor I am using does not support SSSE3 or higher instruction sets. So, I want my application to support just SSE, SSE2 and SSE3 instruction sets. — Harrisson, Feb 21 '14 at 07:51
@Harrisson, is your goal to do the horizontal sum of 8 16-bit values? — Z boson, Feb 21 '14 at 08:40
@Zboson,yes I want to add 8 16 bit adjacent values in two registers and store the result in a 16 bit destination. — Harrisson, Feb 21 '14 at 10:58
@Marat, even I am bit confused with the intrinsic support for atom processors. Here is a link for two contradicting statements. Wiki says that these processors support upto SSSE3, I have also shared a screenshot, please check. I have also added a few links from intel official website which mentions that these processors support upto SSE3 only. I would be very glad if you could share any official links about your claims that all atom processors support SSSE3. — Harrisson, Feb 21 '14 at 11:14
http://en.wikipedia.org/wiki/List_of_Intel_Atom_microprocessors http://ark.intel.com/products/70100/intel-atom-processor-z2580-1mb-cache-2_00-ghz http://ark.intel.com/products/70101/ — Harrisson, Feb 21 '14 at 11:23
@Harrisson, why don't you check your CPU? Use something like /proc/cpuinfo on linux or cpu-z on windows or do it from code with CPUID — Z boson, Feb 21 '14 at 11:49
@Zboson, I am developing an app that supports all atom processors which includes your bay trial, cloverview, Penwell and some other processors as well. Yes, CPUID is an option but this is not related to desktop as most of the atom processors are used in Smartphones & tablets. If you can help me clear the ambiguity with the intrinsic support for the cloverview trail atom processor(wiki says it supports SSSE3 whereas intel web-site says otherwise), where do we stand on this? — Harrisson, Feb 21 '14 at 12:14
@Harrisson, maybe you're just having troubles enabling SSSE3 with Atom in your compiler? Have you searched for this? Here is a discussion where someone had a problem getting SSSE3 working with GCC with Atom http://forum.serviio.org/viewtopic.php?f=14&t=6931 — Z boson, Feb 21 '14 at 14:42
@Harrisson, if you want an official confirmation, ask on Intel Software Forums. But I'm sure all Atoms support SSSE3: gcc will enable SSSE3 if you specify `-march=atom` and the option to enable code-generation for Atom in Intel compiler is named `-xATOM_SSSE3`. Bay Trail is based on newer Silvermont microarchitecture and additionally supports `SSE4.2`. — Marat Dukhan, Feb 21 '14 at 20:55
@Marat, I have posted the question on Intel Developer's Forum but still the question remains unanswered. — Harrisson, Feb 22 '14 at 09:16
@Zboson, I don't have the required hardware for Intel Atom processors but as a developer my job is to support all atom processors which includes cloverview as well. — Harrisson, Feb 22 '14 at 09:18
How important is it the results be in the same location as using the _mm_hadd_epi16 would produce, so long as all the same data is present? Can the surrounding code be restructured? If so you could use a multiplication by 0x0101 to have a horizontal add in every other short. — Apriori, Feb 23 '14 at 23:07
@Apriori, thanks for your help with the instruction. I will try it out on my app. — Harrisson, Feb 24 '14 at 14:41

Marat Dukhan · Answer 1 · 2014-02-23T03:58:54.687

8

_mm_hadd_epi16(a, b) can be simulated with the following code:

/* (b3, a3, b2, a2, b1, a1, b0, a0) */
__m128i ab0 = _mm_unpacklo_epi16(a, b);
/* (b7, a7, b6, a6, b5, a5, b4, a4) */
__m128i ba0 = _mm_unpackhi_epi16(a, b);

/* (b5, b1, a5, a1, b4, b0, a4, a0) */
__m128i ab1 = _mm_unpacklo_epi16(ab0, ba0);
/* (b7, b3, a7, a3, b6, b2, a6, a2) */
__m128i ba1 = _mm_unpackhi_epi16(ab0, ba0);

/* (b6, b4, b2, b0, a6, a4, a2, a0) */
__m128i ab2 = _mm_unpacklo_epi16(ab1, ba1);
/* (b7, b5, b3, b1, a7, a5, a3, a1) */
__m128i ba2 = _mm_unpackhi_epi16(ab1, ba1);


/* (b6+b7, b4+b5, b2+b3, b0+b1, a6+a7, a4+a5, a2+a3, a0+a1) */
__m128i c = _mm_add_epi16(ab2, ba2);

edited Feb 23 '14 at 03:58

answered Feb 21 '14 at 08:13

Marat Dukhan

11,993
4
27
41

4

Thank you for your answer, here are the performance numbers, which I have just now tested on my app. 1) With using _mm_hadd_epi16 it took 255 millisec. 2) using my previous code the performance was 356 millisec. 3) After using your changes for the instruction there was a gain, it took 262 millisec at max. By seeing results we can confirm that your logic is far way better than mine, I was wondering if there is a way to beat the SSSE3 instruction performance by utilising SSE3 and below instruction sets. – Harrisson Feb 21 '14 at 12:32

score 3 · Accepted Answer · answered Feb 21 '14 at 11:03

If your goal is to take the horizontal sum of 8 16-bit values you can do this with SSE2 like this:

__m128i sum1  = _mm_shuffle_epi32(a,0x0E);             // 4 high elements
__m128i sum2  = _mm_add_epi16(a,sum1);                 // 4 sums
__m128i sum3  = _mm_shuffle_epi32(sum2,0x01);          // 2 high elements
__m128i sum4  = _mm_add_epi16(sum2,sum3);              // 2 sums
__m128i sum5  = _mm_shufflelo_epi16(sum4,0x01);        // 1 high element
__m128i sum6  = _mm_add_epi16(sum4,sum5);              // 1 sum
int16_t sum7  = _mm_cvtsi128_si32(sum6);               // 16 bit sum

Thank you for providing me the input I am still testing your changes in my app, once it is complete I will let you know the results. — Harrisson, Feb 21 '14 at 12:33

Performance degrade while using alternative for Intel intrinsics SSSE3

2 Answers2