Alternate way to compute product of pairwise sums mod 10^9+7 faster than O(N^2)

Question

Given an array A of integers of size N, I want to compute

$\left (\prod_{1 \leq i < j \leq N}^{N}{\left(a_i + a_j\right)}\right) \, \mathrm{mod} \, \left(10^9 + 7\right)$

This was a problem in a past inter-college programming competition. We had to write a program that would solve up to 5 instance of this problem, with N ≤ 200,000 and each a_i ≤ 200,000, within a 20-second run-time limit. Clearly, an O(N²) solution would go over the time limit. According to the editorial, the intended solution involved polynomial multiplication using a fast Fourier transform. I'm looking for alternative algorithms that would solve this faster than the naive O(N²) algorithm without FFT (nor NTT). Are there any simple and elegant solutions to this problem?

Known facts:

mod could be 'distributed' inside the product because (x*y) % m = ((x%m) * (y%m)) % m

Update: here is the input/output testcase file during the contest: if it passes this in 20 seconds, it will be accepted. Input: https://www.dropbox.com/s/nw81hts9rniter5/algol.in?dl=0 Output: https://www.dropbox.com/s/kpa7wit35xr4xm4/algol.out?dl=0

Why do you think _Clearly, O(N^2) would fail the time limit._ — Ed Heal, Jan 08 '17 at 09:52
For more detailed explanation of inteded solution, here is the link of the full editorial: https://www.dropbox.com/s/8dwzatjdj9f0ppu/algolympics2015_editorials.pdf?dl=0 (this problem is in the last page). — Mercado, Jan 08 '17 at 09:54
@EdHeal Well, from my experience in programming competitions, automated judges can do 10^6 operations in 1 second. O(N^2) algorithm would take around around 200,000 ^ 2 = 4 * 10^10 operations. Also, the editorial said it would time out on a naive O(N^2) algorithm. — Mercado, Jan 08 '17 at 09:58
@ruakh It's outside the product. I don't know how to fix the notation, would appreciate if you help me fix it. — Mercado, Jan 08 '17 at 11:42
never saw notation with 2 indexes under single PI or SUM so the code translates to: `for (X=1,i=1;i<=N;i++) for (j=i;j<=N;j++) X*=(a[i]+a[j]);` ? or to something else? You could use `Pascal's triangle` to get the polynomial and use power by squaring for powered elements. That should give you some speed up but I fear still not good enough. Anyway I would not use FFT but NTT instead to avoid rounding problems. for aditional ideas see: [Fast bignum square](http://stackoverflow.com/q/18465326/2521214) , [Modular arithmetics and NTT optimizations](http://stackoverflow.com/q/18577076/2521214) — Spektre, Jan 09 '17 at 08:43
and [Multiply polynomials with FFT](http://stackoverflow.com/a/22474892/2521214) — Spektre, Jan 09 '17 at 08:44
Yes, you're correct. I made a correction. The condition should actually be 1 <= i < j <= N. Hence, the loop should actually be 'for (X=1,i=1;i<=N;i++) for (j=i+1;j<=N;j++) X*=(a[i]+a[j]);' Yes, the intended solution is NTT because it's under a mod. But I'm looking for solution that does not involve NTT nor FFT. — Mercado, Jan 09 '17 at 09:27
`each ai ≤ 200,000` natural numbers? `intended solution [to the problem inspiring this question] is NTT because it's under a mod. But I'm looking for solution that does not involve NTT nor FFT` - to the extent that the formula pictured is misleading, would be correct with just the product part? And you are looking for _o(n²)_ solutions to compute a product of up to a 21 digit number of places? — greybeard, Jan 10 '17 at 00:12
(In your initial presentation, it was _i≤j_, in the formula depicted and in your comment above, it is _i — greybeard, Jan 10 '17 at 00:20
@greybeard It is i < j. I updated to the formula yesterday, I apologize. What do you mean by your last question 'to the extent that the formula pictured is misleading, would be correct with just the product part? And you are looking for o(n²) solutions to compute a product of up to a 21 digit number of places'? — Mercado, Jan 10 '17 at 08:41
@Mercado **NTT** is intended because you using integers not because of the `mod` so the **NTT** prime is not your modulo value but specific n-th root of unity instead !!! You just use the **NTT** for convolution and then `modPI` the result of it in `O(n)` — Spektre, Jan 10 '17 at 10:15
@Mercado The only sub-quadratic alternatives to FFT-based algorithms would be the Karatsuba/Toom-Cook methods. But those run in polynomial time and not `O(N log(N))`. — Mysticial, Jan 10 '17 at 18:28
@Spektre I don't exactly see how the NTT is applicable here. `10^9 + 7` is not going to have the appropriate roots-of-unity to get the transform length you want. And even if you pad it out, you're stuck with a prime transform length of `500000003`. So absorbing the modulus into the NTT itself isn't going to be possible. You're probably gonna need to compute the full polynomial convolution. — Mysticial, Jan 10 '17 at 18:48
@Spektre Polynomial multiplication has the property that the `(sum coefficients out) = (sum coefficients input A) * (sum coefficients input B)` Since the sum of the coefficients is `N`, the sum of the output coefficients is `N^2`. Since `N < 200000`, `N^2 < 2^36`. Therefore the largest coefficient is < `2^36` which fits well within the range of a `double` and is at no risk of hitting FFT round-off error. Alternatively, you can do an NTT over a suitable modulus > `2^36` if you fear FFTs that much. — Mysticial, Jan 10 '17 at 18:49
@Mysticial The problem author used chinese remainder theorem and chose primes that have appropriate roots of unity. — Mercado, Jan 11 '17 at 16:33
@greybeard The testcase file contains 5 instances. I never said " & solve up to 5 instance [within] 20-second" in the same line. — Mercado, Jan 12 '17 at 13:29
Observation: with the number of elements equal to the range, no less than half of the possible values occur once, at most. Values not occurring don't contribute to (the multiplicity of) possible sum values: How to make short work of those occurring once? — greybeard, Jan 13 '17 at 01:33
@Mysticial NTT is ok as it is used only on the histogram (most values are equal to `1`) so there is no overflow on uniformly generated data. On the other hand if non uniform distribution is used then the `m` is significantly lower ... just make it work see [edit4] in my answer — Spektre, Jan 14 '17 at 13:36
We don't answer questions on how to make a mathematical algorithm faster. If you want to discus math, go to Mathematics SE, if you want to make your code more efficient, go to Code Review SE. — Caleb Kleveter, Jan 30 '17 at 21:31

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

Had give it some more taught and Pascal's triangle is a no go because it would lead to even more operations. Luckily the mod operation can be moved under the PI so you do not need to use big int's but 64 bit arithmetics instead (or 32bit modmul).

PI(ai+aj) mod p == PI((ai+aj)mod p) mod p ... 1<=i<j<=n

so naive C++ solution is (where p<2^16) your task require 64 bit variables instead (which I have no access to in simple therms).

DWORD modpi(DWORD *a,int n,DWORD p)
    {
    int i,j;
    DWORD x=1;
    for (i=0;i<n;i++)
     for (j=i+1;j<n;j++)
        {
        x*=a[i]+a[j];
        x%=p;
        }
    return x;
    }

Now the p is much bigger then max(a[i]) so you could change:

x%=p;

with:

while (x>=p) x-=p;

but on nowadays CPU's is this even slower.

Still this approach is way too slow (~280 ms for n=10000). If we reorder the values (sort them) then suddenly the things got much better. Each value that is in the array multiple times lead to simplification as its partial sum is almost the same. For example:

a[] = { 2,3,3,4 }
x = (2+3).(2+3).(2+4)
  . (3+3).(3+4)
  . (3+4)

the therms for 3 is almost the same so we can use that. count how many of the same a[i] is there then count the partial PI for single of them. power this by the count and for each instance multiply also by a[i]^instance here C++ example:

DWORD modpi1(DWORD *a,int n,DWORD p)
    {
    int i,j,k;
    DWORD x,y,z;
    sort_asc(a,n);
    for (x=1,i=0;i<n;i++)
        {
        // count the values in k
        for (k=1;(i+1<n)&&(a[i]==a[i+1]);i++,k++);
        // compute partial modPI y
        for (y=1,j=i+1;j<n;j++)
            {
            y*=a[i]+a[j];
            y%=p;
            }
        // add the partial modPI y^k;
        for (j=0;j<k;j++)
            {
            x*=y;
            x%=p;
            }
        // add powers for instances of a[i]
        for (k--;k;k--)
         for (j=0;j<k;j++)
            {
            x*=a[i]+a[i];
            x%=p;
            }
        }
    return x;
    }

This gives you some speedup for each multiple occurrence of value in array. But as your array is as big as the possible numbers in it then do not expect too much of this. For uniformly random data where max(a[i])~=n and quick sort is the speed little bit less the 50%. But if you use prime decomposition like MSalters answer suggests you might get a real speedup because then the repetition rate should be much much higher then ~1 but that would need a lot of work handling the equations.

This code is O(N.N') where N' is count of distinct values in a[]. You can also enhance this further to O(N'.N') by:

sort a[i] by bucket sort O(n) or quick sort O(n.log(n))
do RLE (run length encoding) O(n)
account counts also to partial sum O(n'.n') where n'<=n

The prime decomposition should just change n'<= n to n' <<< n.

Here some measurements using quick sort full 32bit modmul (using 32bit x86 asm which slows things down considerably on my compiler). Random data where max(a[i])~=n:

n=1000;
[   4.789 ms]0: 234954047
[   3.044 ms]1: 234954047
n=10000;
[ 510.544 ms]0: 629694784
[ 330.876 ms]1: 629694784
n=20000;
[2126.041 ms]0: 80700577
[1350.966 ms]1: 80700577

In brackets is the time in [ms], 0: means naive approach, 1: means sorted and partial RLE decompostion of PI. The last value is the result for p=1000000009

If that is still not enough then Apart using DFT/NTT I see no other speedup possible.

[Edit1] full RLE decomposition of a[i]

//---------------------------------------------------------------------------
const DWORD p=1000000009;
const int n=10000;
const int m=n;
DWORD a[n];
//---------------------------------------------------------------------------
DWORD modmul(DWORD a,DWORD b,DWORD p)
    {
    DWORD _a,_b;
    _a=a;
    _b=b;
    asm {
        mov    eax,_a
        mov    ebx,_b
        mul    ebx   // H(edx),L(eax) = eax * ebx
        mov    ebx,p
        div    ebx   // eax = H(edx),L(eax) / ebx
        mov    _a,edx// edx = H(edx),L(eax) % ebx
        }
    return _a;
    }
//---------------------------------------------------------------------------
DWORD modpow(DWORD a,DWORD b,DWORD p)
    {   // b is not mod(p) !
    int i;
    DWORD d=1;
    for (i=0;i<32;i++)
        {
        d=modmul(d,d,p);
        if (DWORD(b&0x80000000)) d=modmul(d,a,p);
        b<<=1;
        }
    return d;
    }
//---------------------------------------------------------------------------
DWORD modpi(DWORD *a,int n,DWORD p)
    {
    int i,j,k;
    DWORD x,y;
    DWORD *value=new DWORD[n+1];// RLE value
    int   *count=new int[n+1];  // RLE count
    // O(n) bucket sort a[] -> count[] because max(a[i])<=n
    for (i=0;i<=n;i++) count[i]=0;
    for (i=0;i< n;i++) count[a[i]]++;
    // O(n) RLE packing value[n],count[n]
    for (i=0,j=0;i<=n;i++)
     if (count[i])
        {
        value[j]=    i;
        count[j]=count[i];
        j++;
        } n=j;
    // compute the whole PI to x
    for (x=1,i=0;i<n;i++)
        {
        // compute partial modPI value[i]+value[j] to y
        for (y=1,j=i+1;j<n;j++)
         for (k=0;k<count[j];k++)
          y=modmul(y,value[i]+value[j],p);
        // add the partial modPI y^count[j];
        x=modmul(x,modpow(y,count[i],p),p);
        // add powers for instances of value[i]
        for (j=0,k=1;k<count[i];k++) j+=k;
        x=modmul(x,modpow(value[i]+value[i],j,p),p);
        }
    delete[] value;
    delete[] count;
    return x;
    }
//---------------------------------------------------------------------------

This is even bit faster as it does sort in O(n) and RLE in O(n) so this is resulting in O(N'.N'). You can take advantage of more advanced modmul,modpow routines if you got any. But for uniform distribution of values is this still no way near usable speeds.

[edit2] full RLE decomposition of a[i]+a[j]

DWORD modpi(DWORD *a,int n,DWORD p) // full RLE(a[i]+a[j]) O(n'.n') n' <= 2n
    {
    int i,j;
    DWORD x,y;
    DWORD nn=(n+1)*2;
    int   *count=new int[nn+1]; // RLE count
    // O(n^2) bucket sort a[] -> count[] because max(a[i]+a[j])<=nn
    for (i=0;i<=nn;i++) count[i]=0;
    for (i=0;i<n;i++)
     for (j=i+1;j<n;j++)
      count[a[i]+a[j]]++;
    // O(n') compute the whole PI to x
    for (x=1,y=0;y<=nn;y++)
     if (count[y])
      x=modmul(x,modpow(y,count[y],p),p);
    delete[] count;
    return x;
    }
//---------------------------------------------------------------------------

And this is even faster nearing desirable times but still few magnitudes off.

n=20000
[3129.710 ms]0: 675975480 // O(n^2) naive
[2094.998 ms]1: 675975480 // O(n'.n) partial RLE decomposition of a[i] , n'<= n
[2006.689 ms]2: 675975480 // O(n'.n') full RLE decomposition of a[i] , n'<= n
[ 729.983 ms]3: 675975480 // T(c0.n^2+c1.n') full RLE decomposition of a[i]+a[j] , n'<= 2n , c0 <<< c1

[Edit3] full RLE(a[i]) -> RLE(a[i]+a[j]) decomposition

I combined all the approaches above and create much faster version. The algrithm is like this:

RLE encode a[i]

simply create a histogram of a[i] by bucket sort in O(n) and then pack to coding value[n'],count[n'] so no zero's are present in the array. This is pretty fast.
convert RLE(a[i]) to RLE(a[i]+a[j])

simply create count of each a[i]+a[j] therm in the final PI similar to RLE(a[i]+a[j]) decomposition but in O(n'.n') without any time demanding operation. Yes this is quadratic but on n'<=n and with very small constant time. But this part is the bottleneck ...
compute the modpi from RLE(a[i]+a[j])

This is simple modmul/modpow in O(n') biggest constant time but low complexity so still very fast.

The C++ code for this:

DWORD modpi(DWORD *a,int n,DWORD p) // T(c0.n+c1.n'.n'+c2.n'') full RLE(a[i]->a[i]+a[j]) n' <= n , n'' <= 2n , c0 <<< c1 << c2
    {
    int i,j,k;
    DWORD x,y;
    DWORD nn=(n+1)*2;
    DWORD *rle_iv =new DWORD[ n+1]; // RLE a[i] value
    int   *rle_in =new int[ n+1];   // RLE a[i] count
    int   *rle_ij=new int[nn+1];    // RLE (a[i]+a[j]) count
    // O(n) bucket sort a[] -> rle_i[] because max(a[i])<=n
    for (i=0;i<=n;i++) rle_in[i]=0;
    for (i=0;i<n;i++)  rle_in[a[i]]++;
    for (x=0,i=0;x<=n;x++)
     if (rle_in[x])
        {
        rle_iv[i]=       x;
        rle_in[i]=rle_in[x];
        i++;
        } n=i;
    // O(n'.n') convert rle_iv[]/in[] to rle_ij[]
    for (i=0;i<=nn;i++) rle_ij[i]=0;
    for (i=0;i<n;i++)
        {
        rle_ij[rle_iv[i]+rle_iv[i]]+=(rle_in[i]*(rle_in[i]-1))>>1; // 1+2+3+...+(rle_iv[i]-1)
        for (j=i+1;j<n;j++)
         rle_ij[rle_iv[i]+rle_iv[j]]+=rle_in[i]*rle_in[j];
        }
    // O(n') compute the whole PI to x
    for (x=1,y=0;y<=nn;y++)
     if (rle_ij[y])
      x=modmul(x,modpow(y,rle_ij[y],p),p);
    delete[] rle_iv;
    delete[] rle_in;
    delete[] rle_ij;
    return x;
    }

And comparison measurements:

n=10000
[ 751.606 ms] 814157062 O(n^2) naive
[ 515.944 ms] 814157062 O(n'.n) partial RLE(a[i]) n' <= n
[ 498.840 ms] 814157062 O(n'.n') full RLE(a[i]) n' <= n
[ 179.896 ms] 814157062 T(c0.n^2+c1.n') full RLE(a[i]+a[j]) n' <= 2n , c0 <<< c1
[  66.695 ms] 814157062 T(c0.n+c1.n'.n'+c2.n'') full RLE(a[i]->a[i]+a[j]) n' <= n , n'' <= 2n , c0 <<< c1 << c2
n=20000
[ 785.177 ms] 476588184 T(c0.n^2+c1.n') full RLE(a[i]+a[j]) n' <= 2n , c0 <<< c1
[ 255.503 ms] 476588184 T(c0.n+c1.n'.n'+c2.n'') full RLE(a[i]->a[i]+a[j]) n' <= n , n'' <= 2n , c0 <<< c1 << c2
n=100000
[6158.516 ms] 780587335 T(c0.n+c1.n'.n'+c2.n'') full RLE(a[i]->a[i]+a[j]) n' <= n , n'' <= 2n , c0 <<< c1 << c2

last times are for this method. Doubling n multiplies the runtime by cca 4 times. so for n=200000 the runtime is around 24 sec on my setup.

[Edit4] my NTT approach for comparison

I know you want to avoid FFT but still I think this is good for comparison. The 32bit NTT is OK for this. Because it is applied only on the histogram which consist only from exponents which are just few bits wide and mostly equal to 1 which prevents overflows even on n=200000. Here C++ source:

DWORD modpi(DWORD *a,int n,int m,DWORD p) // O(n.log(n) RLE(a[i])+NTT convolution
    {
    int i,z;
    DWORD x,y;
    for (i=1;i<=m;i<<=1); m=i<<1;   // m power of 2 > 2*(n+1)
    #ifdef _static_arrays
    m=2*M;
    DWORD rle[2*M];                 // RLE a[i]
    DWORD con[2*M];                 // convolution c[i]
    DWORD tmp[2*M];                 // temp
    #else
    DWORD *rle =new DWORD[m];       // RLE a[i]
    DWORD *con =new DWORD[m];       // convolution c[i]
    DWORD *tmp =new DWORD[m];       // temp
    #endif
    fourier_NTT ntt;
    // O(n) bucket sort a[] -> rle[] because max(a[i])<=n
    for (i=0;i<m;i++) rle[i]=0.0;
    for (i=0;i<n;i++) rle[a[i]]++;

    // O(m.log(m)) NTT convolution
    for (i=0;i<m;i++) con[i]=rle[i];
    ntt.NTT(tmp,con,m);
    for (i=0;i<m;i++) tmp[i]=ntt.modmul(tmp[i],tmp[i]);
    ntt.iNTT(con,tmp,m);
    // O(n') compute the whole PI to x
    for (x=1,i=0;i<m;i++)
        {
        z=con[i];
        if (int(i&1)==0) z-=int(rle[(i+1)>>1]);
        z>>=1; y=i;
        if ((y)&&(z)) x=modmul(x,modpow(y,z,p),p);
        }
    #ifdef _static_arrays
    #else
    delete[] rle;
    delete[] con;
    delete[] tmp;
    #endif
    return x;
    }

You can ignore the _static_arrays (handle it as it is not defined) it is just for simpler debugging. Beware the convolution ntt.modmul is not working with the tasks p but with NTTs modulo instead !!! If you want to be absolutely sure this would work on higher n or different data distributions use 64bit NTT.

Here comaprison to the Edit3 approach:

n=200000
[24527.645 ms] 863132560 O(m^2) RLE(a[i]) -> RLE(a[i]+a[j]) m <= n
[  754.409 ms] 863132560 O(m.log(m)) RLE(a[i])+NTT

As you can see I was not too far away from the estimated ~24 sec :).

Here some times to compare with additional fast convolution methods I tried with use of Karatsuba and FastSQR from Fast bignum square computation to avoid FFT/NTT use:

n=10000
[ 749.033 ms] 149252794 O(n^2)        naive
[1077.618 ms] 149252794 O(n'^2)       RLE(a[i])+fast_sqr32
[ 568.510 ms] 149252794 O(n'^1.585)   RLE(a[i])+Karatsuba32
[  65.805 ms] 149252794 O(n'^2)       RLE(a[i]) -> RLE(a[i]+a[j])
[  53.833 ms] 149252794 O(n'.log(n')) RLE(a[i])+FFT
[  34.129 ms] 149252794 O(n'.log(n')) RLE(a[i])+NTT
n=20000
[3084.546 ms] 365847531 O(n^2)        naive
[4311.491 ms] 365847531 O(n'^2)       RLE(a[i])+fast_sqr32
[1672.769 ms] 365847531 O(n'^1.585)   RLE(a[i])+Karatsuba32
[ 238.725 ms] 365847531 O(n'^2)       RLE(a[i]) -> RLE(a[i]+a[j])
[ 115.047 ms] 365847531 O(n'.log(n')) RLE(a[i])+FFT
[  71.587 ms] 365847531 O(n'.log(n')) RLE(a[i])+NTT
n=40000
[12592.250 ms] 347013745 O(n^2)        naive
[17135.248 ms] 347013745 O(n'^2)       RLE(a[i])+fast_sqr32
[5172.836 ms] 347013745 O(n'^1.585)   RLE(a[i])+Karatsuba32
[ 951.256 ms] 347013745 O(n'^2)       RLE(a[i]) -> RLE(a[i]+a[j])
[ 242.918 ms] 347013745 O(n'.log(n')) RLE(a[i])+FFT
[ 152.553 ms] 347013745 O(n'.log(n')) RLE(a[i])+NTT

Sadly the overhead of Karatsuba is too big so threshold is above n=200000 making it useless for this task.

MSalters · Answer 2 · 2017-01-10T12:19:38.640

1

Since ai <= 200.000 and N<=200.000, there might be 40.000.000.000 terms in total, but you know that ai + aj <= 400.000. There can be at most 400.000 unique terms. That's already 5 orders of magnitude better.

However, most of these terms aren't primes; there only ~40.000 primes under 400.000. You may end up with a somewhat higher multiplicity of each individual term, but that's not a big deal. Calculating (prime^N) modulo 1000000007 is fast enough even for big X.

You can reasonably pre-calculate the factorization of all numbers <=400.000 and get the primes <=400.000 as a free side-effect.

This method achieves a speed-up because we delay multiplication, and instead count small prime factors found through a lookup. By the time we need to do the multiplications, we have a series of exponents and can use repeated squaring to efficiently reduce them.

It's counter-intuitive perhaps that we use prime factorization as a speed-up, when the "well-known fact" is that prime factorization is hard. But this is possible because each term is small, and we repeatedly need the same factorization.

[edit] From the comments, it seems that figuring out the multiplicity of ai+aj is hard, since you can only count the terms where i<j. But that's a non-issue. Count the multiplicity of all terms ai+aj, and divide by two since aj+i==ai+aj. This is only wrong for the diagonal where i==j. This is fixed by adding the multiplicity of all terms ai+ai prior to dividing by 2.

Ex: a={1 2 3}, terms to consider are {1+1, 1+2, 1+3, 2+2, 2+3, 3+3} [triangle]. The multiplicity of 4 is 2 (via 1+3 and 2+2). Instead, consider {1+1, 1+2, 1+3, 2+1, 2+2, 2+3, 3+1, 3+2, 3+3} [square] + {1+1, 2+2, 3+3} [diagonal]. The multiplicity of 4 is now 4 (1+3,2+2,3+1 and 2+2), divide by 2 to get the correct result.

And since the order of a[] no longer matters for the square variant, you can use a counting sort on it. E.g. given {4,5,6,5}, we get 4:1, 5:2, 6:1. Thus the multiplicity of 10 is 4+6:1, 5+5:2, 6+4:1

edited Jan 10 '17 at 12:19

answered Jan 10 '17 at 10:15

MSalters

173,980
10
155
350

I was thinking similarly but there is a problem that it is triangle instead of square so there are 20.000.000.000 therms and no algebraic duplicates. So only numeric duplicates will be present meaning you need to go over all `(n^2)/2` therms and only dummy nested for loop that does nothing will take too much time (`n=100000` took almost 3 seconds at my setup) so if you add any search and computations or table creation it would be much much worse... That is why I discard this approach from the start :( but still a nice idea so +1 – Spektre Jan 10 '17 at 10:25
@Spektre: Don't get hung up on the triangle. It's O(N) to find the terms (ai+ai), which form the diagonal, and the multiplicity of the off-diagonal terms ai+aj is half of that in the square, since ai+aj==aj+ai. – MSalters Jan 10 '17 at 10:37
I guess I don't get the prime/factorisation angle. But multiplying the results of 400´000 fast exponentiations sounded far less intimidating than 40E12 multiplications - _if_ there was a _fast_ way to get the multiplicities - you beat me in sketching one without mentioning counting sort. – greybeard Jan 10 '17 at 10:38
@greybeard Then may be I am missing something as my measurements of empty `O(n^2)` for loop is already in timeout range without any functionality like reading from array. And I think this approach needs to go over all therms or am I missing something? If the array was sorted then I can see this to be the way but other than that ... – Spektre Jan 10 '17 at 10:42
@greybeard: It's even better, you just have the 40.000 primes to contend with. Factorization by lookup is O(1). – MSalters Jan 10 '17 at 10:43
How do the primes help me? – greybeard Jan 10 '17 at 10:44
@greybeard I am not writing about computation itself but you first need to evaluate what to compute and that part worries me – Spektre Jan 10 '17 at 10:44
`From the comments, it seems that figuring out the multiplicity of ai+aj` _in o(n²) time_ (_small o_(micron): notably faster). I thought I had it figured out, but what I had in mind would be _O(n) for any fixed value of ai+aj_ - alas, _O(n²)_ for all/every possible one: please spell out in detail how to establish the multiplicities. – greybeard Jan 10 '17 at 12:02
@greybeard I am not doing prime decomposition of `ai+aj` but just occurrence of `a[i]==a[j]` which is `O(n)` + `On.log(n)` for the sort (not the bubble implementation of mine of coarse) so the RLE is done in `O(n.log(n))`. ... but that comment was not meant for me I guess. – Spektre Jan 10 '17 at 12:33
`Count the multiplicity of all terms ai+aj` by generating each ai,aj is _O(n²)_. I can sort _A_ much faster than that, allowing to establish a _single_ multiplicity in _O(n)_, which is optimal. I can account for "the diagonal" in _O(n)_. I can generate _all_ "sum" multiplicities in, say, ascending order. I fail to see how to do so much faster than _O(n²)_. – greybeard Jan 10 '17 at 13:41
@greybeard the answers got swapped and I realized that after posting the comment. anyway the RLE in my answer is just sort of decomposition which tells you how many count of each value there is... the example of mine use just partialy this. I will try the full exploit (really curious about it) but right now I got more pressing matters to code (need to recode one filter for 3D surface edge detection so I can test it tomorrow on machines as I got just now approved time to test new stuff on them estimate 1-3 hours of work :) ) – Spektre Jan 10 '17 at 13:55
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/132806/discussion-between-greybeard-and-spektre). – greybeard Jan 10 '17 at 13:57
from a [comment to Spektre's answer](http://stackoverflow.com/questions/41531391/alternate-way-to-compute-product-of-pairwise-sums-mod-1097-faster-than-on2/41566161#comment70344231_41566161): `from [the multiplicities of the values in A] you derive the frequency of all sums (which roughly is a convolution)` - please elaborate on this _in your answer_. – greybeard Jan 10 '17 at 14:23

Alternate way to compute product of pairwise sums mod 10^9+7 faster than O(N^2)

2 Answers2

Linked