1

I am trying to rewrite a code from c++ source code including SSE instructions, to only c++ code. I know i will lose performance, but its an experiment, i am trying to perform. I was wondering if there is a C++ equivalent for doing the same as , __mm_unpackhi_pd and __mm_unpacklo_pd. I have zero knowledge about SSE.

A snippet of the code for reference which i am trying to convert. Any knowledge or tips would be helpful. Thank you.

for (unsigned chunk = 0; chunk < chunks; chunk++)
{
  unsigned start = chunk * chunksize;
  unsigned end =
    std::min((chunk + 1) * chunksize, (unsigned)2 * w);
  __m128d a2b2 =
    _mm_load_pd(d_origx +
                ((2 * init_G_offset + start) & n2_m_1));
  unsigned i2_mod_B = 0;
  for (unsigned i = start; i < end; i += 2)
    {
      __m128d ab = a2b2;
      a2b2 =
        _mm_load_pd(d_origx +
                    ((origx_offset + i) & n2_m_1));
      __m128d cd = _mm_load_pd(d_filter + i);

      __m128d cc = _mm_unpacklo_pd(cd, cd);
      __m128d dd = _mm_unpackhi_pd(cd, cd);

      __m128d a0a1 = _mm_unpacklo_pd(ab, a2b2);
      __m128d b0b1 = _mm_unpackhi_pd(ab, a2b2);

      __m128d ac = _mm_mul_pd(cc, a0a1);
      __m128d ad = _mm_mul_pd(dd, a0a1);
      __m128d bc = _mm_mul_pd(cc, b0b1);
      __m128d bd = _mm_mul_pd(dd, b0b1);

      __m128d ac_m_bd = _mm_sub_pd(ac, bd);
      __m128d ad_p_bc = _mm_add_pd(ad, bc);

      __m128d ab_times_cd = _mm_unpacklo_pd(ac_m_bd, ad_p_bc);
      __m128d a2b2_times_cd =
        _mm_unpackhi_pd(ac_m_bd, ad_p_bc);

      __m128d xy = _mm_load_pd(d_x_sampt + i2_mod_B);
      __m128d x2y2 = _mm_load_pd(d_x_sampt + i2_mod_B + 2);

      __m128d st = _mm_add_pd(xy, ab_times_cd);
      __m128d s2t2 = _mm_add_pd(x2y2, a2b2_times_cd);

      _mm_store_pd(d_x_sampt + i2_mod_B, st);
      _mm_store_pd(d_x_sampt + i2_mod_B + 2, s2t2);

      i2_mod_B += 4;
    }
}
  • 1
    Look here: http://stackoverflow.com/questions/7156908/sse-intrinsic-functions-reference – mike Apr 24 '17 at 22:13
  • I'm sure it's a pedantic quibble over terminology, but… **this *is* C++ code**! Sure, it uses SSE intrinsics, but it's still written in C++, and it would require a C++ compiler to compile it. Most importantly, it has all the advantages of using SSE instructions for optimal performance, with none of the typical disadvantages. The only "limitation" is that it requires your processor have SSE support, which is not a high bar, just limits portability. – Cody Gray - on strike Apr 27 '17 at 08:53
  • But of course its a c++. My mistake. I just wished to remove the SSE instructions. – Arnov Sinha Apr 27 '17 at 19:29

2 Answers2

0

Below you find the description of the two functions, I've also linked each function to its reference page. The whole reference is available here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

_mm_unpackhi_p

__m128d _mm_unpackhi_pd (__m128d a, __m128d b)

Unpack and interleave double-precision (64-bit) floating-point elements from the high half of a and b, and store the results in dst.


_mm_unpacklo_pd

_m128d _mm_unpacklo_pd (__m128d a, __m128d b)

Unpack and interleave double-precision (64-bit) floating-point elements from the low half of a and b, and store the results in dst.

mike
  • 4,929
  • 4
  • 40
  • 80
  • Hi Mike, Thank you or providing me the reference link, actually i wanted to know if there is a C/C++ alternative for the same without using SSE instructions. I would like the compiler itself to auto-vectorize these things instead of using SSE instructions. – Arnov Sinha Apr 24 '17 at 22:41
  • Now that you know what these functions do, you can implement them yourself in C++. The reference contains detailed descriptions. – mike Apr 24 '17 at 23:16
  • Oh great thanks Mike, i really appreciate it. One last question. Can you tell me how to write src1[127:0] in C++. Thank you. – Arnov Sinha Apr 24 '17 at 23:28
  • I assume it stands for bit 127 to bit 0 from `src1`. I don't know how `__m128d` is represented and I also don't know how to interpret the endianness. You should create a new question. – mike Apr 26 '17 at 10:27
0

Exactly how to implement it depends on your representation, but basically you return a new value composed of the high (or low) half of a concatenated with the high (or low) half of b. For example:

typedef double[2] __m128d;

__m128d _mm_unpackhi_pd(__m128d a, __m128d b) {
  __m128d res;
  res[0] = a[1];
  res[1] = b[1];
  return res;
}

__m128d _mm_unpacklo_pd(__m128d a, __m128d b) {
  __m128d res;
  res[0] = a[0];
  res[1] = b[0];
  return res;
}

Wierd timing on this question… I found this issue while implementing this function for SIMDe, and it's only 17 days old. If you want to use SIMDe as a reference, these functions are in sse2.h along with a lot of others. The code in SIMDe is a bit more complex than what's above, but that's mostly just to match the implementations of the other _mm_unpack* functions.

nemequ
  • 16,623
  • 1
  • 43
  • 62