Using omp parallel for in multiplication algorithm (BigInt multiplication)

Question

For educational purpose I'm developing c++ library for operating with large numbers represented as vectors of chars (vector<char>).

Here is algorithm that I am using for multiplication:

string multiplicationInner(CharVector a, CharVector b) {
  reverse(a.begin(), a.end());
  reverse(b.begin(), b.end());

  IntVector stack(a.size() + b.size() + 1);

  int i, j;
  for (i = 0; i < a.size(); i++)
    for (j = 0; j < b.size(); j++)
      stack[i + j] += charToInt(a[i]) * charToInt(b[j]);
 

  for (int i = 0; i < stack.size(); i++) {
    int num = stack[i] % 10;
    int move = stack[i] / 10;
    stack[i] = num;

    if (stack[i + 1])
      stack[i + 1] += move;
    else if (move)
      stack[i + 1] = move;
  }

  CharVector stackChar = intVectorToCharVector(&stack);
  deleteZerosAtEnd(&stackChar);
  reverse(stackChar.begin(), stackChar.end());

  return charVectorToString(&stackChar);
};

This function is called billion times in my program, so I would like to implement #pragma omp parallel for in it.

My question is: How can i parallelize first cycle?

This is what I have tried:

int i, j;
  #pragma omp parallel for
  for (i = 0; i < a.size(); i++) {
    for (j = 0; j < b.size(); j++)
      stack[i + j] += charToInt(a[i]) * charToInt(b[j]);
  }

Algorithm stops working properly. Advice needed.

Edit: This variant works, but (with omp parallel for) benchmark shows it is 15x-20x slower than without it. (CPU: M1 Pro, 8 cores)

#pragma omp parallel for schedule(dynamic)
  for (int k = 0; k < a.size() + b.size(); k++) { 
    for (int i = 0; i < a.size(); i++) {
      int j = k - i;
      if (j >= 0 && j < b.size()) {
        stack[k] += charToInt(a[i]) * charToInt(b[j]);
      }
    }
  }

This is part of my program, where multiplication is called most often. (Miller-Rabin test)

BigInt modularExponentiation(BigInt base, BigInt exponent, BigInt mod) {
  BigInt x = B_ONE; // 1
  BigInt y = base;

  while (exponent > B_ZERO) { // while exponent > 0
    if (isOdd(exponent))
      x = (x * y) % mod;
    y = (y * y) % mod;
    exponent /= B_TWO; // exponent /= 2
  }

  return (x % mod);
};

bool isMillerRabinTestOk(BigInt candidate) {
  if (candidate < B_TWO)
    return false;

  if (candidate != B_TWO && isEven(candidate))
    return false;

  BigInt canditateMinusOne = candidate - B_ONE;
  BigInt s = canditateMinusOne;
  while (isEven(s))
    s /= B_TWO;

  for (int i = 0; i < MILLER_RABIN_TEST_ITERATIONS; i++) {
    BigInt a = BigInt(rand()) % canditateMinusOne + B_ONE;
    BigInt temp = s;
    BigInt mod = modularExponentiation(a, temp, candidate);

    while (temp != canditateMinusOne && mod != B_ONE && mod != canditateMinusOne) {
      mod = (mod * mod) % candidate;
      temp *= B_TWO;
    }

    if (mod != canditateMinusOne && isEven(temp))
      return false;
  }

  return true;
};

The problem is that multiple threads can have the same sum `i + j` and therefore there is a race-condition on `stack`. You could use a reduction clause (each thread will fill its own version of stack and later do a reduction), but as collisions are somewhat seldom, it might be better to use atomics (needs to be benchmarked). As I would expect the sizes of the vectors to be relatively small, a `collapse` clause might also help, so there is more parallelism. On the other hand, due to characters being small, one might want to use SIMD parallelism on the inner loop instead. — paleonix, Apr 19 '22 at 11:55
Take a look at the [Examples](https://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf) (Chapter 5 on SIMD, 1.7 on the collapse clause, 6.4 for atomics, 7.9 reduction clause). For doing a reduction on a full array see [here](https://stackoverflow.com/q/20413995/10107454). — paleonix, Apr 19 '22 at 12:00
Note that using int is not great for vectorisation using SIMD instruction. Indeed, the larger the types, the smaller the number of lanes, the lower the performance. Additionally note that most compilers use SSE on x86 systems by default unless you manually enable AVX. This means your code can possibly be 8 times faster with SIMD instructions. I do not expect compilers to vectorize it efficiently though because of the integer product. Using intrinsics will certainly help on x86-64 platforms. — Jérôme Richard, Apr 19 '22 at 23:07
Somewhat offtopic: Your second loop over `stack` goes too far, you are accessing `i + 1`, so the loop condition has to be `i < stack.size() - 1`. I am also not sure about the size of `stack` itself. For the first loop you only need `a.size() + b.size() - 1` elements. If I understand corrcectly, you want one element more for the carry, so I would think that you are allocating one element more than needed. If the number of elements for your big integers is static, I would use `std::array` instead of vectors. — paleonix, Apr 20 '22 at 07:46

Victor Eijkhout · Accepted Answer · 2022-04-19T12:38:12.207

2

Your loops do not have the proper structure for parallelization. However, you can transform them:

for (k=0; k<a.size()+b.size(); k++) { 
  for (i=0; i<a.size(); i++) {
    j=k-i;
    stack[k] += a[i] * b[j];
}

Now the outer loop has no conflicts. Look at this as a "coordinate transformation": you're still traversing the same i/j row/column space, but now in new coordinates: k/i stands for diagonal/row.

Btw, this code is a little metaphorical. Check your loop bounds, and use the right multiplication. I'm just indicating the principle here.

edited Apr 19 '22 at 12:38

answered Apr 19 '22 at 01:35

Victor Eijkhout

5,088
2
22
23

1

I get the idea, but now j index overflows: When k is bigger than b.size() and i is, for example, 0, then j is bigger than b.size(). – ПВ201 Николотов Александр Apr 19 '22 at 10:29
1

It can be fixed using a conditional `if (j >= 0) { stack[k] += ...; }`, but I don't think that this will be the most performant solution (it might be, always benchmark). – paleonix Apr 19 '22 at 12:02
2

Due to the work imbalance between threads, you will want a `schedule(dynamic)` clause with it. `guided` scheduling might also work. The default `static` scheduling is bad when working with work imbalance – paleonix Apr 19 '22 at 12:06
If I add conditional ```if (j >= 0 && j < b.size())``` it starts working, but benchmark shows that my program runs 20x times slower. (Same algorithm without parallel for works fine). I'm so confused... – ПВ201 Николотов Александр Apr 19 '22 at 12:44
2

Could you add the new code fragment to your question? Saying "it doesn't work" gives us nothing to work with. Also: how big are your integers? For loops <100 elements I wouldn't expect much improvement. – Victor Eijkhout Apr 19 '22 at 13:38
Thanks again for helping me. I've edited my question (added the new code). My integers are 128 digits (I'm generating large prime numbers for RSA algorithm and doing Miller-Rabin test). I thought if I parallelize multiplication (it is main operation in whole my project) then it will work much faster. – ПВ201 Николотов Александр Apr 19 '22 at 15:13
1

The `dynamic` schedule has a considerable runtime overhead. I would first try `guided`. Also: to prevent cacheline issues I would chunk it: `guided,8`. – Victor Eijkhout Apr 19 '22 at 15:21
Just tried (guided, 8). Even slower somehow... Maybe multiplication algorithm is not that good for parallelizing, what do you think? If, so, maybe you could give me advice, where should I try to add parallelization? Miller-Rabin test is slowest part of my program. I've added its code to question. – ПВ201 Николотов Александр Apr 19 '22 at 15:43
2

Looking at your code I find that I overlooked the fact that you have vectors of `char`s, not `int`s. So a cacheline is 64 elements, and you'd have to use `guided.64` to prevent false sharing of cachelines. But that takes away almost all parallelism. Conclusion: your problem is too small to parallelize on that low level. You need to look higher up for parallelism. Why not make the test iterations parallel? You need to worry a little about the random number generator, but the C++ one is threadsafe. – Victor Eijkhout Apr 19 '22 at 17:04
I understood. Gonna look for better place to implement parallelism. Thank you! – ПВ201 Николотов Александр Apr 19 '22 at 17:57
You should probably still make sure that your compiler uses SIMD here. The overhead should be way smaller. – paleonix Apr 20 '22 at 07:24
@paleonix Note that his scalar multiplication is something complicated, so SIMD might not be immediately applicable. – Victor Eijkhout Apr 20 '22 at 12:39
@VictorEijkhout When adding `-march=skylake` in gcc11, it will be vectorized. On my machine I see a performance increase of 1.7x using google-benchmark. I guess going from `char` to packed `int` takes some performance away, but 1.7x is still very significant. – paleonix Apr 27 '22 at 21:08

Using omp parallel for in multiplication algorithm (BigInt multiplication)

1 Answers1