1

I have a large number of modulo calculations to perform. The basic calculation is as follows:

const uint64_t start;       // Some "large" number that does NOT change
uint32_t prime[bigNumber];  // Precalculated sequential prime numbers (generated on the fly from a bit compaction storage method for space reasons).
uint64_t answer[bigNumber]; // The "modulo" answers

for (uint64_t i = 0; i < bigNumber; i++) {
   uint32_t factor = prime[i];
   answer[i] = (factor - 1) - ((start - 1) % factor);
}

Note: start is generally much larger than prime[i].

Is there a faster way to calculate the "answer(s)" without performing a modulo / division for each iteration (AKA can knowing answer[i - 1] help you get answer[i] faster)? Any other improvements or suggestions would greatly be appreciated.

ChipK
  • 401
  • 2
  • 9
  • 20
  • Maybe you can apply fast modulo from [this](https://stackoverflow.com/a/33333636/6523658) source. – dzimiks May 21 '18 at 18:50
  • 2
    Unless `prime` sequence and/or `start` have some peculiar properties, I suspect there is no faster way. Generally given large enough `start` you can make the `answer[i]` any value for some `i` leaving all other `answer`s the same (see [Chinese remainder theorem](https://en.wikipedia.org/wiki/Chinese_remainder_theorem)). – SergGr May 21 '18 at 18:50
  • Does the primes being precalculated mean you could perform arbitrary preprocessing on them? – harold May 21 '18 at 19:06
  • fast modulo won't apply since start is much generally much larger than prime[i]. arbitrary preprocessing also probably not useful due to storage restrictions of generating prime[i] on the fly from a bit array. – ChipK May 21 '18 at 20:28
  • 1
    The only feasible way to speed up this routine is to run parts of it in parallel. – n. m. could be an AI May 21 '18 at 20:43
  • 2
    It's really edge-case-ridden but emulating modulo with floating point arithmetic is faster (in scalar code already) and also vectorizable – harold May 21 '18 at 21:20
  • @harold, could you provide any references about the methods to do such emulation? – SergGr May 21 '18 at 22:08
  • @SergGr not specifically (well there's [this](http://www.texmacs.org/joris/simd/simd.html) but it doesn't really address this exact thing), it's just estimating the quotient inexactly but "close enough" and then patching it up as needed – harold May 21 '18 at 22:29
  • 2
    My implementation already uses parallelism with multiple threads (not shown). As for data parallelism or vectorization, the problem I'm running into is "start" can exceed 2^53 -1 (max int without losing precision in conversions). Also divide only exists for x86 vectorization intrinsics in non-integer operations. Besides the conversions from integer to double and vice versa slowing it down, vectorization is also limited to only 2 doubles at a time. Yes, the vector double divide is faster than a single integer division. All great ideas but have some limitation to me. – ChipK May 22 '18 at 15:52
  • I was hoping to use some relation like this (of course without the additional divides and mods): c mod (a+b) = (c mod a) + [bc \ (a+b)] mod b - [bc \ (a + b)] mod a. – ChipK May 22 '18 at 15:56

1 Answers1

0

I wanted to put down a partial answer in response to some of the above comments. It helps by doing multiple mod(s) at once.

  if (start < (1ULL << DBL_MANT_DIG)) {
    __m256d div1 = _mm256_broadcastsd_pd(_mm_cvtsi64_sd(_mm_setzero_pd(), start - 1));
    __m128i one  = _mm_set1_epi32(-1);
    __m128i fact = *(__m128i *)(&prime[i]);

    __m256d div2 = _mm256_cvtepi32_pd(fact);

    __m128i rem = _mm256_cvtpd_epi32(_mm256_fnmadd_pd(
                  _mm256_floor_pd(_mm256_div_pd(div1, div2)), div2, div1));

    *(__m256i *)(&answer[i]) = _mm256_cvtepu32_epi64(_mm_sub_epi32(fact,
                               _mm_sub_epi32(rem, one)));
  }

Please comment if you have an improvement to this partial answer.

ChipK
  • 401
  • 2
  • 9
  • 20
  • Replaced _mm256_mul_pd and _mm256_sub_pd with _mm256_fnmadd_pd – ChipK May 29 '18 at 15:38
  • 1
    Some compilers can fuse `_mm256_mul_pd` with `_mm256_sub_pd` into an FMA for you, if you compile with `-mfma` as well as `-mavx`. (Or much better to set tuning options as well, with `-march=haswell` or higher.) But if you don't care about AVX without FMA, or the compilers you care about don't do a good job, then sure use FMA intrinsics. – Peter Cordes May 30 '18 at 02:47