Most efficient way to evaluate a binary scalar product mod 2

Question

I am currently performing Fourier transforms for some physics problem, and a huge bottleneck of my algorithm comes from the evaluation of a scalar product modulo 2.

For a given integer N, I have to represent all the numbers in binary up to 2^N-1.

For each of these numbers, represented as a binary vector (e.g. 15 = 2^3 + 2^2 +2+2^0 = (1,1,1,1,0,...,0)) I have to evaluate its scalar products with all numbers from 0 to 2^N-1 in binary form modulo 2.

(for example, the scalar product 1.15 =(1,0,0,...,0).(1,1,1,1,0,...,0)=1*1+1*0+...=1 mod 2)

Note that the components are kept in binary form during the reducing modulo 2

(1,1).(1,1)=1*1+1*1 and not 1*1+2*2

This is basically 2^(2N) scalar products that I have to perform and reduce modulo 2.

I am having difficulty to get more than N = 18.

I was wondering whether some clever mathematical trick can be used to greatly reduce the time spent doing them.

I was thinking of some kind of recursion (i.e. saving results for N in a file and deduce the results for N+1) but I am not sure this would help. Indeed, with this recursion, knowing the results for N, I could cut the vector for N+1 corresponding to the N part plus an additional digit, but then at each scalar product, instead of evaluating the scalar product, I would have to tell my computer to go and read a big file (because I probably wouldn't be able to keep it all in dynamic memory), which is probably time-consuming, perhaps more than the ~20 multiplications I have to perform for each of the products.

Is there any known optimized number-theoretical algorithm allowing the evaluation of such a scalar product modulo 2 very quickly ? Are there any rules or ideas I am not aware of that I could exploit ?

Sorry for the terrible formatting, I just can't get LateX to work in here.

You could just compute the [bitwise-AND](https://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#and(java.math.BigInteger)) of the two numbers and then check the parity (see if the [number of 1-bit](https://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#bitCount()) is odd). — kennytm, Apr 08 '17 at 21:23
How are they used, perhaps the number of evaluations can be reduced too? And do you want only high level suggestions or also implementation strategies for specific architectures (which?)? Btw Latex is not enabled on SO. — harold, Apr 08 '17 at 21:35
@harold I am evaluating the c_k times exponential of i*k*r with k = i*pi times a binary vector n, and r is also a binary vector, so evaluating n.r allows me to know whether I should subtract or add some term in my fourier summation which sums over k, and I want it for all r. The c_k are very specific and I don't think there are many useful patterns in them. I don't know what "high-level" means, but I am mostly interested in reducing the complexity rather than making minor changes in syntax to speed it up (as I am pretty sure the improvements in this case will be... well... minor) — Evariste, Apr 08 '17 at 21:42
@trincot I am ideally trying to reach around N = 24. I doubt it is realistic to hope for much more — Evariste, Apr 08 '17 at 21:43
You're not using the butterfly structure then? It has a more regular +/- structure. Also I wouldn't discount low level optimization just yet, for example in this case we might use the parity flag (not exposed in high level programming languages) or `popcnt` (rarely exposed), otherwise it'll just be a XOR reduction which is only log(n) layers deep anyway so not *that* big a deal. — harold, Apr 08 '17 at 21:52
.. for example in C++ you can use `_mm_popcnt_u32(a & b) & 1`, if you're on the right platform. — harold, Apr 08 '17 at 22:16

score 4 · Answer 1 · edited Jan 09 '20 at 02:07

4

The sum of the product of corresponding bits, modulo 2, will be equal to the number of 1 bits in the AND of the two numbers, modulo 2.

As you can get the binary representation of a number easily, it might not be necessary to actually create an array of bits for them, but just use the integer data type in your programming language, which allows for at least 32 bits. Many languages offer bit operators, such as a AND (&) and XOR (^).

Counting the 1 bits in a number can be done with the variable-precision SWAR algorithm.

Here is program in Python that calculates this product modulo 2 for 2 numbers:

def numberOfSetBits(i):
     i = i - ((i >> 1) & 0x55555555);
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
     return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;

def product(a, b):
    return numberOfSetBits(a & b) % 2

Instead of counting the bits with numberOfSetBits, you could fold the bits together with XORs, first the 16 most significant bits with the 16 least significant bits, then of that result the 8 most significant with the 8 least significant bits, until you have one bit left. Again in Python:

def bitParity(i):
    i = (i >> 16) ^ i
    i = (i >> 8) ^ i
    i = (i >> 4) ^ i
    i = (i >> 2) ^ i
    i = (i >> 1) ^ i
    return i % 2

def product(a, b):
    return bitParity(a & b)

edited Jan 09 '20 at 02:07

phuclv

37,963
15
156
475

answered Apr 08 '17 at 21:24

trincot

317,000
35
244
286

Sorry I guess I wasn't clear enough, I will edit further : the mod 2 applies to the summation of the components kept in binary form ! (i.e. 15.5=(1,1,1,1,0,..)*(1,1,0,...)=1+1=2=0 mod 2) – Evariste Apr 08 '17 at 21:29
Could you please write out a complete "product" (that apparently is a sum)? Why is that 1+1? – trincot Apr 08 '17 at 21:32
What I am trying to achieve is a scalar product with vectors obtained from regular integers. For example, for 15, I get (1,1,1,1,0...), and that's it. I never change back to decimal. Then I use the usual dot product rules x1*x1+x2*x2+... with x1, x2, ... = 0 or 1 always, then I reduced mod 2, so 1.15=1*1+0*1+0*1+0*1+0*0+...+0*0=1=1 mod 2 – Evariste Apr 08 '17 at 21:34
@dfri, *N* refers to the number of bits. Anyway I had just revised my answer to deal with the clarified definition of *product* in this question. – trincot Apr 08 '17 at 21:53
@trincot I had to re-read the question another two times to now realize this fact, thanks. – dfrib Apr 08 '17 at 21:55
Interesting approach, thanks for this, although this alone probably won't make me reach anything above what I already have (testing with python yields a slower algorithm than my naive one, but I code in C++, so maybe it has something to do with this). – Evariste Apr 08 '17 at 22:08
Yes, Python probably cannot compete with C++. I also had to make some corrections to my answer. – trincot Apr 08 '17 at 22:10

score 1 · Answer 2 · edited Jan 09 '20 at 02:06

If you change the order that you are evaluating these pairs (a matrix of size 2ⁿ x 2ⁿ), then you can efficiently figure out which products-mod-2 change in each row of your evaluation.

Using Gray code, you can iterate over each value from 0 ... 2ⁿ-1 in a special order where only 1 bit of the outer-loop value changes each time. You can store 1 bit for each value from 0 ... 2ⁿ-1 representing the previous row's product-mod-2 values, and then change it based on whether the changing bit has any effect, which it only does when the corresponding bit in the other (inner loop) number is 1 (if it's 0 then the binary AND will be 0 no matter what the value of the other bit).

In C:

int N = 5;
int max = (1 << N) - 1;
unsigned char* prev = calloc((1 << N) / 8, 1);

// for the first row all the products will be zero, so start at row 1
for(int a = 1; a <= max; a++)
{
    int grey = a ^ (a >> 1);    // compute the grey code
    int prev_grey = (a - 1) ^ ((a - 1) >> 1);
    int changed_bit = grey ^ prev_grey;

    for(int b = 0; b <= max; b++)
    {
        // the product will be changed only if b has a 1 at the same place 
        // (otherwise it will be 0 regardless)
        if(b & changed_bit)
        {
            prev[b >> 3] ^= (1 << (b & 7));
        }
        int mod = (prev[b >> 3] & (1 << (b & 7))) != 0;
        printf("mod value of %d and %d is %d\n", grey, b, mod);
    }   
}

The inner loop can be optimized even more because you can easily figure out which values of b have a non-zero value in the position of the changed bit: for example if it's in position 10 then there will be runs of 1024 in a row of 0 then 1 etc. So you know that you have 1024 values where the product-mod-2 is the same as in the previous row etc. It's not clear to me if this helps you though because I don't know what you are doing with these products.

The inner loop could also be unrolled (e.g. 32 or 64 times) so that you don't read and write to the prev array each time, but rather process blocks of 32 or 64 bits at a time.

Most efficient way to evaluate a binary scalar product mod 2

2 Answers2