How to speed up multiple inner products in python

Question

I have some simple code that does the following.

It iterates over all possible length n lists F with +-1 entries. For each one, it iterates over all possible length 2n lists S with +-1 entries, where the first half of $S$ is simply a copy of the second half. The code computes the inner product of F with each sublist of S of length n. For each F, S it counts the inner products that are zero until the first non-zero inner product.

Here is the code.

#!/usr/bin/python

from __future__ import division
import itertools
import operator
import math

n=14
m=n+1
def innerproduct(A, B):
    assert (len(A) == len(B))
    s = 0 
    for k in xrange(0,n):
        s+=A[k]*B[k]
    return s

leadingzerocounts = [0]*m
for S in itertools.product([-1,1], repeat = n):
    S1 = S + S
    for F in itertools.product([-1,1], repeat = n):
        i = 0
        while (i<m):
            ip = innerproduct(F, S1[i:i+n])
            if (ip == 0):
                leadingzerocounts[i] +=1
                i+=1
            else:
                break

print leadingzerocounts

The correct output for n=14 is

[56229888, 23557248, 9903104, 4160640, 1758240, 755392, 344800, 172320, 101312, 75776, 65696, 61216, 59200, 59200, 59200]

Using pypy, this takes 1 min 18 seconds for n = 14. Unfortunately, I would really like to run it for 16,18,20,22,24,26. I don't mind using numba or cython but I would like to stay close to python if at all possible.

Any help speeding this up is very much appreciated.

I'll keep a record of the fastest solutions here. (Please let me know if I miss an updated answer.)

n = 22 at 9m35.081s by Eisenstat (C)
n = 18 at 1m16.344s by Eisenstat (pypy)
n = 18 at 2m54.998s by Tupteq (pypy)
n = 14 at 26s by Neil (numpy)
n - 14 at 11m59.192s by kslote1 (pypy)

Might not get a chance to add the code, but noting that `IP(A,B) = IP(A[:n/2 + 1], B[:n/2 + 1]) + IP(A[n/2 + 1:], B[n/2 + 1:])`, allows for some improvement based on similar technique as used by [subset sum](http://en.wikipedia.org/wiki/Subset_sum_problem#Exponential_time_algorithm). This should allow for `O(2^N)` algorithm rather than `O(2^(2N))`, though it might require `O(2^N)` space. This makes use of finding all IPs for pairs of size `N/2` (of which there are `O(2^N))`, then use this to build up the solution set. A graph can be used to handle the state transitions found in the while loop. — Nuclearman, Nov 23 '14 at 09:28
After a bit of testing, the approach above may not be practical. The issue being that handling the state transitions seems to require branching, which introduces numbers that have been previously eliminated as well as duplicates. Basically, the algorithm I wrote gives incorrect counts past the second one (i=2 and beyond) and simply removing duplicates isn't enough to fix it though it helps greatly, which suggests this approach is probably flawed, as far as getting O(2^N) space/time performance goes. — Nuclearman, Nov 29 '14 at 01:47
Your free to try it yourself anyway. The IP matching part is straightforward enough and is very fast for getting the first count. It's the batch handling of the shifts that I was unable to get right, and question if possible. I probably won't try to implement a correct solution of the algorithm as without it being `O(2^N)`, which I find unlikely, theres a fair chance it won't be better than David Eisenstat's answer. — Nuclearman, Nov 30 '14 at 01:13

David Eisenstat · Accepted Answer · 2014-11-20T19:02:04.627

This new code gets another order of magnitude speedup by taking advantage of the cyclic symmetry of the problem. This Python version enumerates necklaces with Duval's algorithm; the C version uses brute force. Both incorporate the speedups described below. On my machine, the C version solves n = 20 in 100 seconds! A back-of-the-envelope calculation suggests that, if you were to let it run for a week on a single core, it could do n = 26, and, as noted below, it's amenable to parallelism.

import itertools


def necklaces_with_multiplicity(n):
    assert isinstance(n, int)
    assert n > 0
    w = [1] * n
    i = 1
    while True:
        if n % i == 0:
            s = sum(w)
            if s > 0:
                yield (tuple(w), i * 2)
            elif s == 0:
                yield (tuple(w), i)
        i = n - 1
        while w[i] == -1:
            if i == 0:
                return
            i -= 1
        w[i] = -1
        i += 1
        for j in range(n - i):
            w[i + j] = w[j]


def leading_zero_counts(n):
    assert isinstance(n, int)
    assert n > 0
    assert n % 2 == 0
    counts = [0] * n
    necklaces = list(necklaces_with_multiplicity(n))
    for combo in itertools.combinations(range(n - 1), n // 2):
        for v, multiplicity in necklaces:
            w = list(v)
            for j in combo:
                w[j] *= -1
            for i in range(n):
                counts[i] += multiplicity * 2
                product = 0
                for j in range(n):
                    product += v[j - (i + 1)] * w[j]
                if product != 0:
                    break
    return counts


if __name__ == '__main__':
    print(leading_zero_counts(12))

C version:

#include <stdio.h>

enum {
  N = 14
};

struct Necklace {
  unsigned int v;
  int multiplicity;
};

static struct Necklace g_necklace[1 << (N - 1)];
static int g_necklace_count;

static void initialize_necklace(void) {
  g_necklace_count = 0;
  for (unsigned int v = 0; v < (1U << (N - 1)); v++) {
    int multiplicity;
    unsigned int w = v;
    for (multiplicity = 2; multiplicity < 2 * N; multiplicity += 2) {
      w = ((w & 1) << (N - 1)) | (w >> 1);
      unsigned int x = w ^ ((1U << N) - 1);
      if (w < v || x < v) goto nope;
      if (w == v || x == v) break;
    }
    g_necklace[g_necklace_count].v = v;
    g_necklace[g_necklace_count].multiplicity = multiplicity;
    g_necklace_count++;
   nope:
    ;
  }
}

int main(void) {
  initialize_necklace();
  long long leading_zero_count[N + 1];
  for (int i = 0; i < N + 1; i++) leading_zero_count[i] = 0;
  for (unsigned int v_xor_w = 0; v_xor_w < (1U << (N - 1)); v_xor_w++) {
    if (__builtin_popcount(v_xor_w) != N / 2) continue;
    for (int k = 0; k < g_necklace_count; k++) {
      unsigned int v = g_necklace[k].v;
      unsigned int w = v ^ v_xor_w;
      for (int i = 0; i < N + 1; i++) {
        leading_zero_count[i] += g_necklace[k].multiplicity;
        w = ((w & 1) << (N - 1)) | (w >> 1);
        if (__builtin_popcount(v ^ w) != N / 2) break;
      }
    }
  }
  for (int i = 0; i < N + 1; i++) {
    printf(" %lld", 2 * leading_zero_count[i]);
  }
  putchar('\n');
  return 0;
}

You can get a bit of speedup by exploiting the sign symmetry (4x) and by iterating over only those vectors that pass the first inner product test (asymptotically, O(sqrt(n))x).

import itertools


n = 10
m = n + 1


def innerproduct(A, B):
    s = 0
    for k in range(n):
        s += A[k] * B[k]
    return s


leadingzerocounts = [0] * m
for S in itertools.product([-1, 1], repeat=n - 1):
    S1 = S + (1,)
    S1S1 = S1 * 2
    for C in itertools.combinations(range(n - 1), n // 2):
        F = list(S1)
        for i in C:
            F[i] *= -1
        leadingzerocounts[0] += 4
        for i in range(1, m):
            if innerproduct(F, S1S1[i:i + n]):
                break
            leadingzerocounts[i] += 4
print(leadingzerocounts)

C version, to get a feel for how much performance we're losing to PyPy (16 for PyPy is roughly equivalent to 18 for C):

#include <stdio.h>

enum {
  HALFN = 9,
  N = 2 * HALFN
};

int main(void) {
  long long lzc[N + 1];
  for (int i = 0; i < N + 1; i++) lzc[i] = 0;
  unsigned int xor = 1 << (N - 1);
  while (xor-- > 0) {
    if (__builtin_popcount(xor) != HALFN) continue;
    unsigned int s = 1 << (N - 1);
    while (s-- > 0) {
      lzc[0]++;
      unsigned int f = xor ^ s;
      for (int i = 1; i < N + 1; i++) {
        f = ((f & 1) << (N - 1)) | (f >> 1);
        if (__builtin_popcount(f ^ s) != HALFN) break;
        lzc[i]++;
      }
    }
  }
  for (int i = 0; i < N + 1; i++) printf(" %lld", 4 * lzc[i]);
  putchar('\n');
  return 0;
}

This algorithm is embarrassingly parallel because it's just accumulating over all values of xor. With the C version, a back-of-the-envelope calculation suggests that a few thousand hours of CPU time would suffice to calculate n = 26, which works out to a couple hundred dollars at current rates on EC2. There are undoubtedly some optimizations to be made (e.g., vectorization), but for a one-off like this I'm not sure how much more programmer effort is worthwhile.

Thank you that does speed things up a bit. I can get up to n=16 with your method. — Simd, Nov 19 '14 at 20:16
I have to admit I don't understand why this answer doesn't get more upvotes. SO is a mystery sometimes. — Simd, Nov 23 '14 at 10:24
@user2179021 Don't worry about it. I had a lot of fun writing this answer. — David Eisenstat, Nov 23 '14 at 14:51

score 7 · Answer 2 · answered Nov 16 '14 at 19:14

One very simple speed up of a factor of n is to change this code:

def innerproduct(A, B):
    assert (len(A) == len(B))
    for j in xrange(len(A)):
        s = 0 
        for k in xrange(0,n):
            s+=A[k]*B[k]
    return s

to

def innerproduct(A, B):
    assert (len(A) == len(B))
    s = 0 
    for k in xrange(0,n):
        s+=A[k]*B[k]
    return s

(I don't know why you have the loop over j, but it just does the same calculation each time so is unnecessary.)

Thanks that was just a bug! As you answered so quickly, I will just fix the question if you don't mind. — Simd, Nov 16 '14 at 19:26

score 2 · Answer 3 · edited May 23 '17 at 12:16

I have had a go transferring this into NumPy arrays and borrowed from this question: itertools product speed up

This is what I have got, (there may be more speed ups here):

def find_leading_zeros(n):
    if n % 2:
        return numpy.zeros(n)
    m = n+1
    leading_zero_counts = numpy.zeros(m)
    product_list = [-1, 1]
    repeat = n
    s = (numpy.array(product_list)[numpy.rollaxis(numpy.indices((len(product_list),) * repeat),
                                                  0, repeat + 1).reshape(-1, repeat)]).astype('int8')
    i = 0
    size = s.shape[0] / 2
    products = numpy.zeros((size, size), dtype=bool)
    while i < m:
        products += (numpy.tensordot(s[0:size, 0:size],
                                     numpy.roll(s, i, axis=1)[0:size, 0:size],
                                     axes=(-1,-1))).astype('bool')
        leading_zero_counts[i] = (products.size - numpy.sum(products)) * 4
        i += 1

    return leading_zero_counts

Running for n=14 I get:

>>> find_leading_zeros(14)
array([ 56229888.,  23557248.,   9903104.,   4160640.,   1758240.,
        755392.,    344800.,    172320.,    101312.,     75776.,
        65696.,     61216.,     59200.,     59200.,     59200.])

So all looks good. As for speed:

>>> timeit.timeit("find_leading_zeros_old(10)", number=10)
28.775046825408936
>>> timeit.timeit("find_leading_zeros(10)", number=10)
2.236745834350586

See what you think.

EDIT:

The original version used 2074MB of memory for N=14 so I have removed the concatenated array and used numpy.roll instead. Also changing the data types to use a boolean array, brings the memory down to 277MB for n=14.

Time wise the edit is a little faster again:

>>> timeit.timeit("find_leading_zeros(10)", number=10)
1.3816070556640625

EDIT2:

Ok so adding in the symmetry as pointed out by David I reduce this again. It now uses 213MB. The comparison timing compared to previous edits:

>>> timeit.timeit("find_leading_zeros(10)", number=10)
0.35357093811035156

I can now do the n=14 case in 14 seconds on my mac book which is not bad for "pure python" I think.

Your solution uses too much RAM for n=14 for me to test, unfortunately. — Simd, Nov 19 '14 at 20:15

score 2 · Answer 4 · answered Nov 20 '14 at 13:27

I tried to speed this up and I failed badly :( But I'm sending the code, it's somehow faster, but not fast enough for values like n=24.

My assumptions

Your lists consist from values, so I decided to use numbers instead of lists - each bit represents one of possible values: if bit is set, then this means 1, if it's zeroed it means -1. The only possible result of multiplication {-1, 1} is 1 or -1, so I used bitwise XOR instead of multiplication. I also noticed that there's a symmetry, so you only need to check subset (one fourth) of possible lists and multiply result by 4 (David explained this in his answer).

Finally I put results of possible operations to tables to eliminate need of calculations. It takes a lot of memory, but who cares (for n=24 it was about 150MB)?

And then @David Eisenstat answered the question :) So, I took his code and modified to bit-based. It's about 2-3 times faster (for n=16 it took ~30 seconds, compared to ~90 of David's solution), but I think it's still not enough to get results for n=26 or so.

import itertools

n = 16
m = n + 1
mask = (2 ** n) - 1

# Create table of sum results (replaces innerproduct())
tab = []
for a in range(2 ** n):
    s = 0
    for k in range(n):
        s += -1 if a & 1 else 1
        a >>= 1
    tab.append(s)

# Create combination bit masks for combinations
comb = []
for C in itertools.combinations(range(n - 1), n // 2):
    xor = 0
    for i in C:
       xor |= (1 << i)
    comb.append(xor)

leadingzerocounts = [0] * m
for S in xrange(2 ** (n-1)):
    S1 = S + (1 << (n-1))
    S1S1 = S1 + (S1 << n)

    for xor in comb:
        F = S1 ^ xor

        leadingzerocounts[0] += 4
        for i in range(1, m):
            if tab[F ^ ((S1S1 >> i) & mask)]:
                break
            leadingzerocounts[i] += 4

print(leadingzerocounts)

Conclusions

I thought I invented something brilliant and hoped that all this mess with bits will give a great speed boost, but the boost was disappointingly small :(

I think the reason is the way Python uses operators - it calls function for every arithmetic (or logical) operation, even if it could be done by single assembler command (I hoped pypy will be able to simplify operations to that level, but it didn't). So, probably if C (or ASM) was used with this bit-operating solution, it would perform great (maybe you could get to n=24).

Dropping all the way to C didn't have much additional impact (see my edit). The problem is that the amount of work is growing by about a factor of 16 every time that n is increased by 2. — David Eisenstat, Nov 20 '14 at 15:31
I managed to do n = 18 with help from pypy and your code. Thank you. — Simd, Nov 20 '14 at 19:26

score 0 · Answer 5 · answered Nov 20 '14 at 16:49

In my opinion, a good way to get a performance boost is to use the python builtins.

First use map to calculate the product of the entries:

>>> a =[1,2,3]
>>> b = [4,5,6]
>>>map(lambda x,y : x*y, a , b)
[4, 10, 18]

Then use reduce to compute the sums:

>>> reduce(lambda v,w: v+w, map(lambda x,y :x*y, a, b))
32

So then your function becomes

def innerproduct(A, B):
    assert (len(A) == len(B))
    return reduce(lambda v,w: v+w, map(lambda x,y :x*y, A, B))

Next, we can take all of those "for loops" out and replace them with generators and catch the StopIteration.

#!/usr/bin/python

from __future__ import division
import itertools
import operator
import math

n=14
m=n+1
def innerproduct(A, B):
    assert (len(A) == len(B))
    return reduce(lambda v,w: v+w, map(lambda x,y :x*y, A, B))


leadingzerocounts = [0]*m

S_gen = itertools.product([-1,1], repeat = n)

try:
    while(True):
       S = S_gen.next()
       S1 = S + S
       F_gen = itertools.product([-1,1], repeat = n)
       try:
           while(True):
               F = F_gen.next()
               for i in xrange(m):
                   ip = innerproduct(F, S1[i:i+n])
                   if (ip == 0):
                       leadingzerocounts[i] +=1
                       i+=1
                   else:
                      break
       except StopIteration:
           pass

except StopIteration as e:
    print e

print leadingzerocounts

I observed a speed up for smaller n, but my jalopy lacked the horse power to compute my version nor the original code for n=14. A way to speed this up further would be to memoize the line:

    F_gen = itertools.product([-1,1], repeat = n)

Thanks for this. Your code as is is unfortunately quite slow for n = 14 as you suggested. — Simd, Nov 20 '14 at 21:12

How to speed up multiple inner products in python

5 Answers5