0

I'd like to convert this scalar code:

int64_t res = floatValue * int64Value;

using SSE/SIMD (built with -march=nocona), and later back the value to float:

float finalRes = res;

Is it possible? I would do somethings like this:

__m128 res = _mm_mul_ps(floatValue4, int64Value4);
__m128i res1 = _mm_cvttps_epi64(res);
__m128i res2 = _mm_cvttps_epi64(_mm_movehl_epi64(res, res));

but it seems I can't find neither _mm_cvttps_epi64 or _mm_movehl_epi64 for the target platform.

markzzz
  • 47,390
  • 120
  • 299
  • 507
  • Not sure what are your expectation. Did you try see what compiler generates: https://godbolt.org/z/od7WMoYTG ? From my point of view looks fine. – Marek R Apr 16 '21 at 10:36
  • @Marek R what do you mean by "fine"? You show me "scalar" code. I need vector one :) – markzzz Apr 16 '21 at 10:46
  • 1
    https://stackoverflow.com/questions/41144668/how-to-efficiently-perform-double-int64-conversions-with-sse-avx – nemequ Apr 16 '21 at 11:38
  • @nemequ int64_to_double_full needs _mm_blend_epi16, which is SSE4.1 (and so, don't match my arch target) – markzzz Apr 16 '21 at 13:22
  • 1
    `_mm_blendv_epi16` is easy to emulate on SSE2; instead of `_mm_blendv_epi16(x, 0x88)`, try something like `__m128i tmp = _mm_set_epi16(0, ~0, ~0, ~0, 0, ~0, ~0, ~0); __m128i xL = _mm_or_si128(_mm_and_si128(m, x), _mm_andnot_si128(m, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))));`. For 0x33 it's even easier since one of the vectors is all zeros: `_mm_and_si128(xH, _mm_set_epi16(~0, ~0, 0, 0, ~0, ~0, 0, 0))`. – nemequ Apr 16 '21 at 17:30
  • 1
    Are you sure you need 64-bit integers? In your last question ([How to convert/merge two double (m128d) into one single (m128)?](https://stackoverflow.com/q/67111388)) your integer constant converted to a compile-time-constant float, which is a lot more efficient than anything you can do with int64. Until AVX-512, there's no single-instruction FP<->int64 SIMD conversion (only scalar), and no int64 SIMD multiply. – Peter Cordes Apr 16 '21 at 22:26
  • 1
    What you're looking for with `_mm_movehl_epi64(v,v)` is just `_mm_unpackhi_epi64(v,v)`, or if you only want the low half of the result then the `_mm_srli_si128(v, 8)` you were already using also works. – Peter Cordes Apr 16 '21 at 22:29
  • @PeterCordes the original code use int64_t. Thats because the integrators need that headroom (relative to the settings you place). In loop, both integrators and combs get sum/sub, right? – markzzz Apr 17 '21 at 06:42
  • I don't really grok the overall algorithm of this code; not one I'm familiar with and you didn't describe it. Can't you use `double` instead, though? Scaling by 2^32 is fine for `double`. (Or even `float`; it has enough exponent range, but IDK if it has enough mantissa precision for you). OTOH, for that serial dependency where you update `s -= pCombs[i]`, integer is nice because it's lower latency for the loop-carried dep chain. Same for the prefix-sum with `pIntegrators`. You'd have to unroll more to hide `double` add latency. int64_t add/sub is fine with SIMD, but conversion sucks. – Peter Cordes Apr 17 '21 at 06:57
  • @PeterCordes its a fir CIC filter (decimator); more about it https://www.dsprelated.com/showarticle/1337.php – markzzz Apr 17 '21 at 07:14

0 Answers0