What is the most performant method to calculate the sum of lcm(i, j), i and j ranging from 1 to k?

Question

THE PROBLEM :

MY SOLUTION :

The solution works, but unfortunately it is really too slow when k is large (k can be a number as big as 69533550916004 for example)

var k = 14;
var res = 0;
for(let i = 1; i <= k; i++){
  for(let j = 1; j <= k; j++){
    res += lcm(i, j);
  }
}
console.log(res)

function lcm(a,b){
  return (a * b) / gcd(a,b);
}

function gcd(a,b){
  if(!b){
    return a; 
  }  
  return gcd(b, a % b); 
}

Thanks

One little thing: "falsy" evaluations like `if(!b)` take more work than a strict boolean expression like `b===0`. Since it looks like `b` is always a number you can probably gain some performance that way. — David784, Jan 23 '21 at 22:30
Does this answer your question? [Javascript least common multiple algorithm](https://stackoverflow.com/questions/41863426/javascript-least-common-multiple-algorithm) — Rojo, Jan 23 '21 at 22:33
`lcm(i, j)` is the same as `lcm(j, i)` so you're doing too much repeated work. — Blastfurnace, Jan 23 '21 at 22:34
Are you serious about `k` being 69533550916004? One can't even find all prime numbers up to that threshold with ordinary PC. And doing your problem is certainly harder than that. — user3707125, Jan 23 '21 at 23:45
Something to do with Catalan numbers? (https://oeis.org/search?q=69533550916004&fmt=short) — Dave, Jan 24 '21 at 01:13

David Eisenstat · Accepted Answer · 2021-01-24T17:11:14.193

That's a large value of k -- quadratic time won't do. Here's an O(k log log k)-time algorithm, the same as the Sieve of Eratosthenes.

First, some number theory. We know that lcm(i, j) = (i*j) / gcd(i, j). If we were trying to evaluate

 k   k
sum sum (i*j),
i=1 j=1

then we could rewrite using the distributive law:

 k   k            k    2
sum sum (i*j) = (sum i)  = ((k+1)*k / 2)^2.
i=1 j=1          i=1

We can try to salvage the idea by summing over possible greatest common divisors g:

 k   k                k         [k/g]      2
sum sum lcm(i, j) =? sum (1/g)*( sum  (g*i) ),
i=1 j=1              g=1         i=1

but of course this is wrong, since we end up with (e.g.) lcm(2, 2) being represented twice, once with g=1, once with g=2. To straighten it out, we turn to Möbius inversion, which gives us the correct formula

 k   k               k                [k/g]      2
sum sum lcm(i, j) = sum f(g)*(1/g)*( sum  (g*i) ),
i=1 j=1             g=1                i=1

                     k                              2
                  = sum g*f(g)*(([k/g]+1)*[k/g] / 2) ,
                    g=1

where the definition of f(g) is

f(g) =  product  (1-p).
       prime p|g

We can evaluate f in bulk by modifying the Sieve of Eratosthenes. Python 3 below:

def fast_method(k):
    f = [1] * (k + 1)
    for i in range(2, k + 1):
        if f[i] != 1:
            continue
        for j in range(i, k + 1, i):
            f[j] *= 1 - i
    return sum(g * f[g] * ((k // g + 1) * (k // g) // 2) ** 2 for g in range(1, k + 1))


import math


def slow_method(k):
    return sum(math.lcm(i, j) for i in range(1, k + 1) for j in range(k + 1))


def test():
    for k in range(100):
        assert fast_method(k) == slow_method(k), k


test()

(If you're serious about that value of `k`, you might also have to segment the sieve using standard techniques.) — David Eisenstat, Jan 24 '21 at 14:50
wanted to try your approach but cant get pass your math notation and python code is foreign to me... 1. so in the math equation above the last sum you got `[k/g]` assuming sum bound its `ceil` or `floor` of `k/g` or something else? 2. the `f(g) = product (1-p). prime p|g` means PI of all primes that decompose `g` ? how about powers of primes? for example `f(12) = 2*2*3` ? — Spektre, Jan 25 '21 at 11:09
@Spektre `[k/g] = floor(k/g)`. The symbol `|` means "divides", so `f(12) = (1 - 2) (1 - 3) = 2`. — David Eisenstat, Jan 25 '21 at 13:32
thx ... so powers of primes are not accounted and it really is `1-p(i)` ... I did not expect negative sub-results so I taught I was seeing something else that it should be ... will give it a shot and update my answer with code if successful. — Spektre, Jan 25 '21 at 13:38
I managed to get it working along with my 2D SoE based prime decomposition and unsigned arithmetics. It should be faster than yours prime decomposition however only for k for which the LUT fits into memory so for big `k` 1D LUT,SoE or entirely different prime decomposition must be used anyway. Thanks again ... — Spektre, Jan 25 '21 at 14:07

Spektre · Answer 2 · 2021-01-25T14:10:32.853

First simple optimizations:

As mentioned lcm(i,j) is the same as lcm(j,i)

so you compute almost all therms of the sum twice. Instead you can compute the repeating therms just once and then double the sum result.
remaining non repeating therms lcm(i,j) where i==j

The lcm(i,i)=i so they form a series 1+2+3+...+k which is computable with equation: k*(k+1)/2
prime decomposition 2D LUT using SoE

Both previous optimizations combined are slightly more than 2 times faster (however the complexity is still quadratic so this is still not fast enough for bigger numbers !!!

For more advanced and faster approach you need enough memory so you could do SoE (Sieve of Eratosthenes) based prime decomposition to speed up the GCD and LCM operations considerably (better complexity on bignums and overally smaller constant time). The idea is to have k lists holding the prime decomposition of each number up to k.

Then to obtain GCD you just compare the 2 lists and multiply together all primes (with their powers) present in both lists.

The LCM is similar but you multiply almost all primes you just do use those which are present in both lists only once. This can be enhanced further by not using prime decomposition of one number in list and multiply by the number instead.

The search of both list can be done in O(m) as the SoE will create the lists already sorted where m is the number of primes that decompose i or j ... which is much much less then their actual value.

In case you want to pre-allocate the individual i-th list you can take number of primes up to sqrt(i) as upper bound. Which can be computed with ~10% error by:
```
sqrt(i)/ln(sqrt(i));
```
So I used this just to be sure I have enough:
```
ceil(1.2*sqrt(i)/ln(sqrt(i)))+1;
```
On integers you could use log2 (or number of bits) instead + some big enough margin.

However 2D LUT like this will be big for big k !!! Which will not fit into memory of standard computer so this is a no go for bignum k.

This approach will give you much bigger boost (measurement show ~10x times faster runtime against the #1+#2 optimizations on small k and gets better with increasing k). This approach affects complexity (its better) on bigints but its dependent on the bigint implementation. My estimate its slightly below quadratic on mine bigints but had not tried k nowhere near thresholds for multiplication as this is still too slow for bigints.
primes up to k 1D LUT using SoE

this will speed up the GCD and LCM operation but not as fast as the previous LUT. On the bright side this requires much less memory. However your current limit would still need ~15.8 TByte of storage which might be possible in future. However this will be still very slow...
using PSLQ

This might be the most promising way to achieve this task fast. Do a PSLQ analysis on the series formed by using
```
k={ 1,2,3,... }
A(k) = sum_of_lcm_up_to(k)
A = 1,7,28,72,177,303,604,948,1497,...
```
which might return you more direct equation.

IIRC BPP formula for computing Pi was found this way. Until then it was not known that we can compute individual digits of Pi without needing the previous ones nor that it can be computed without big numbers arithmetics...

However I never perform nor fully understood the PSLQ as its far from my field of expertise my naive understanding is that is something like PCA done on individual digits of series of numbers trying to find numeric relation between iteration of series

Here C++ code that combines #1,#2,#3 compilable with bigints and without too (if not using any 3th party libs) you just rewrite the bigint typedef with whatever you want to use. But beware on 32bit unsigned integers (without bigints) this will overflow with small numbers so the limit is only k<=390 !!!

//---------------------------------------------------------------------------
#include <math.h>
//---------------------------------------------------------------------------
typedef unsigned int bigint; // rewrite int to whatewer bigint datatype you got
//---------------------------------------------------------------------------
// prime decomposition LUT
int pdec_n=0,pdec_siz=0,*pdec_num=NULL,**pdec_LUT=NULL;;
void pdec_exit()    // free memory
    {
    if (pdec_num) delete[] pdec_num;
    if (pdec_LUT)
        {
        if (pdec_LUT[0]) delete[] pdec_LUT[0];
        delete[] pdec_LUT;
        }
    pdec_n=0; pdec_siz=0; pdec_num=NULL; pdec_LUT=NULL;
    }
void pdec_init(int n)                       // pdec_LUT[i][j] holds prime decomposition of i where j={ 0,1,2...pd_num[i]-1 }
    {
    double x;
    int i,j,a;
    pdec_exit();
    pdec_n=n;
    pdec_num=new int[n+1];              // primes in decomposition
    pdec_LUT=new int*[n+1];             // decompositions
    // compute decompositions size
    pdec_num[0]=0;
    pdec_num[1]=0;
    for (j=0,i=2;i<=n;i++)
        {
        x=i;
        x=sqrt(x);                      // primes up to sqrt(x)
        x=x*log(2.7182818284590452353602874713527)/log(x);
        x=ceil(1.2*x)+1;                // +20% and round up +1 as a margin for error
        a=x;
        pdec_num[i]=a;
        j+=a;
        }
    pdec_siz=j+10;
    // compute starts of each decomposition
    pdec_LUT[0]=new int[j];             // allocate buffer for all decompositions once
    pdec_LUT[1]=pdec_LUT[0];
    for (j=1,i=2;i<=n;j=i,i++)
        {
        pdec_LUT[i]=pdec_LUT[j]+pdec_num[j];
        pdec_num[j]=0;
        }
    pdec_num[n]=0;
    // SoE prime decomposition
    for (i=2;i<=n;i++)
     if (pdec_num[i]==0)                // is prime?
      for (j=i;j<=n;j+=i)               // add it to all its multiples
       for (a=j;a%i==0;a/=i){ pdec_LUT[j][pdec_num[j]]=i; pdec_num[j]++; } // add also powers
    }
//---------------------------------------------------------------------------
bigint pdec_gcd(int a,int b)            // fast GCD using pdec_LUT
    {
    bigint _gcd;
    int i,j,ni,nj,*pi,*pj;
    pi=pdec_LUT[a]; ni=pdec_num[a];
    pj=pdec_LUT[b]; nj=pdec_num[b];
    for (_gcd=1,i=0,j=0;(i<ni)&&(j<nj);)
        {
        a=pi[i];
        b=pj[j];
        if (a==b){ _gcd*=a; i++; j++; continue; }
        if (a<b) i++; else j++;
        }
    return _gcd;
    }
//---------------------------------------------------------------------------
bigint pdec_lcm(int a,int b)            // fast LCM using pdec_LUT
    {
    bigint _lcm;
    int i,j,ni,nj,*pi,*pj;
    if (pdec_num[a]<pdec_num[b]){ i=a; a=b; b=i; }  // b has less multiplications
    pi=pdec_LUT[a]; ni=pdec_num[a];
    pj=pdec_LUT[b]; nj=pdec_num[b];
    for (_lcm=a,i=0,j=0;(i<ni)&&(j<nj);)    // a is fully multiplied
        {
        a=pi[i];
        b=pj[j];
        if (a==b){ i++; j++; continue; }
        if (a< b){ i++; }
        if (a> b){ _lcm*=b; j++; }          // use primes from b only if not present in a
        }
    for (;j<nj;j++) _lcm*=pj[j];
    return _lcm;
    }
//---------------------------------------------------------------------------
int gcd(int a,int b)                        // slow euclid GCD
    {
    if(!b) return a;
    return gcd(b, a % b);
    }
//---------------------------------------------------------------------------
bigint lcm(int a,int b)                     // slow LCM
    {
    bigint _lcm;
    _lcm=a;
    _lcm*=b;
    _lcm/=gcd(a,b);
    return _lcm;
    }
//---------------------------------------------------------------------------
bigint sum_of_lcm0(int k)
    {
    int i,j;
    bigint s;
    // sum of lcm(i,j) up to k
    for (s=0,i=1;i<=k;i++)
     for (j=i+1;j<=k;j++)           // do not compute the same terms twice
      s+=lcm(i,j);
    s+=s;                           // double the repeating therms
    s+=k*(k+1)>>1;                  // add sum of series: 1+2+3+...+k
    return s;
    }
//---------------------------------------------------------------------------
bigint sum_of_lcm1(int k)
    {
    int i,j;
    bigint s;
    // sum of lcm(i,j) up to k
    for (s=0,i=1;i<=k;i++)
     for (j=i+1;j<=k;j++)
      s+=pdec_lcm(i,j);
    s+=s;                           // double the repeating therms
    s+=k*(k+1)>>1;                  // add sum of series: 1+2+3+...+k
    return s;
    }
//---------------------------------------------------------------------------

Usage:

int k;
bigint s;
k=390; // cant get above this on 32bit uints
pdec_init(k);
s=sum_of_lcm0(k); // #1+#2    k limited by bigint
s=sum_of_lcm1(k); // #1+#2+#3 k limited by bigint and pdec_init
pdec_exit();

And here measurement for the faster approach using pdec_init(1000) and 64bit bigints:

[   0.507 ms] pdec init 38 KBytes
[   1.222 ms] k= 100 sum=    18446224
[   4.613 ms] k= 200 sum=   295280840
[  10.734 ms] k= 300 sum=  1485491616
[  21.983 ms] k= 400 sum=  4692090228
[  38.277 ms] k= 500 sum= 11454503268
[  52.168 ms] k= 600 sum= 23723529128
[  67.608 ms] k= 700 sum= 43946379280
[  83.128 ms] k= 800 sum= 74987181444
[ 106.942 ms] k= 900 sum=120019988208
[ 132.031 ms] k=1000 sum=183011304660

Here David Eisenstat's approach using the 2D prime decomposition LUT using unsigned math and ported into C++

bigint sum_of_lcm2(int k)
    {
    int g,*p,n,a,a0,i,sign;
    bigint s,f;
    // sum of lcm(i,j) up to k
    for (s=0,g=1;g<=k;g++)
        {
        f=k/g;
        f=((f+1)*f)>>1;
        f*=f;
        f*=g;
        p=pdec_LUT[g];
        n=pdec_num[g];
        for (sign=0,a=0,i=0;i<n;i++)
            {
            a0=a; a=p[i];
            if (a0==a) continue; // ignore powers
            f*=(a-1);
            sign=!sign;
            }
        if (sign) s-=f; else s+=f;
        }
    return s;
    }

and measurements:

[  62.085 ms] pdec init 18888 KBytes
[   0.685 ms] k=  1000 sum=            183011304660
[   1.373 ms] k=  2000 sum=           2926354006228
[   2.072 ms] k=  3000 sum=          14805042797184
[   2.781 ms] k=  4000 sum=          46781822900688
[   3.539 ms] k=  5000 sum=         114218125116052
[   4.756 ms] k=  6000 sum=         236808736848248
[   5.223 ms] k=  7000 sum=         438721136165016
[   6.806 ms] k=  8000 sum=         748454504589824
[   7.645 ms] k=  9000 sum=        1198740116138576
[   8.424 ms] k= 10000 sum=        1827127167830060
[   9.303 ms] k= 11000 sum=        2675130127257140
[  10.292 ms] k= 12000 sum=        3788610508835184
[  10.220 ms] k= 13000 sum=        5218119174379260
[  10.736 ms] k= 14000 sum=        7019284675919704
[  10.782 ms] k= 15000 sum=        9249622447024312
[  13.937 ms] k= 16000 sum=       11973761092807672
[  14.610 ms] k= 17000 sum=       15260078673441888
[  14.535 ms] k= 18000 sum=       19179420697925096
[  15.512 ms] k= 19000 sum=       23809411309067844
[  15.049 ms] k= 20000 sum=       29232792766430776
[  15.454 ms] k= 21000 sum=       35530729593528460
[  19.366 ms] k= 22000 sum=       42798909693505916
[  19.164 ms] k= 23000 sum=       51128789507207824
[  17.958 ms] k= 24000 sum=       60615343952225808
[  18.378 ms] k= 25000 sum=       71367596207492300
[  22.220 ms] k= 26000 sum=       83490183930172416
[  22.565 ms] k= 27000 sum=       97093514169469316
[  21.821 ms] k= 28000 sum=      112296054697946652
[  24.109 ms] k= 29000 sum=      129222336308688284
[  24.484 ms] k= 30000 sum=      147985909450776556
[  23.518 ms] k= 31000 sum=      168726314083539724
[  24.437 ms] k= 32000 sum=      191573248859171168
[  25.810 ms] k= 33000 sum=      216665462512674040
[  26.562 ms] k= 34000 sum=      244144733826449780
[  25.961 ms] k= 35000 sum=      274165636445073084
[  26.890 ms] k= 36000 sum=      306861019309280956
[  27.347 ms] k= 37000 sum=      342403256303560068
[  28.250 ms] k= 38000 sum=      380951178712974600
[  28.792 ms] k= 39000 sum=      422654544640465536
[  32.039 ms] k= 40000 sum=      467702202175553120
[  31.689 ms] k= 41000 sum=      516261770711667188
[  32.103 ms] k= 42000 sum=      568493102008765720
[  33.605 ms] k= 43000 sum=      624601491354014924
[  33.251 ms] k= 44000 sum=      684766511634347904
[  38.647 ms] k= 45000 sum=      749168090843424836
[  34.710 ms] k= 46000 sum=      818010181586572760
[  36.381 ms] k= 47000 sum=      891505184775229460
[  35.670 ms] k= 48000 sum=      969815208006099344
[  36.745 ms] k= 49000 sum=     1053194632366715464
[  39.265 ms] k= 50000 sum=     1141860626298697932
[  41.430 ms] k= 51000 sum=     1235961937510193376
[  39.108 ms] k= 52000 sum=     1335793856714706980
[  39.395 ms] k= 53000 sum=     1441568696460268148
[  41.918 ms] k= 54000 sum=     1553459252290960280
[  40.703 ms] k= 55000 sum=     1671762162244218552
[  41.705 ms] k= 56000 sum=     1796724219557144384
[  43.305 ms] k= 57000 sum=     1928541735647080528
[  49.717 ms] k= 58000 sum=     2067474716400847492
[  43.599 ms] k= 59000 sum=     2213789303836196316
[  48.271 ms] k= 60000 sum=     2367722439679998864
[  45.474 ms] k= 61000 sum=     2529554127958058476
[  47.711 ms] k= 62000 sum=     2699556160225194044
[  50.947 ms] k= 63000 sum=     2877966575602455352
[  53.505 ms] k= 64000 sum=     3065104550449828668
[  54.559 ms] k= 65000 sum=     3261217251285430196
[  53.066 ms] k= 66000 sum=     3466580213396622284
[  51.004 ms] k= 67000 sum=     3681481683988726580
[  52.176 ms] k= 68000 sum=     3906300655499965688
[  52.597 ms] k= 69000 sum=     4141140842108345608
[  54.037 ms] k= 70000 sum=     4386474829016834456
[  55.015 ms] k= 71000 sum=     4642634007103668512
[  56.036 ms] k= 72000 sum=     4909694650694395656
[  56.753 ms] k= 73000 sum=     5188180625623422844
[  56.944 ms] k= 74000 sum=     5478395966863778240
[  58.519 ms] k= 75000 sum=     5780503390808063112
[  58.851 ms] k= 76000 sum=     6095030347012208488
[  59.481 ms] k= 77000 sum=     6422238822380824088
[  59.821 ms] k= 78000 sum=     6762431127210826000
[  60.474 ms] k= 79000 sum=     7115946997843513388
[  62.483 ms] k= 80000 sum=     7483237599224918568
[  62.773 ms] k= 81000 sum=     7864354728203024788
[  63.597 ms] k= 82000 sum=     8259936085084243856
[  63.737 ms] k= 83000 sum=     8670356746307209408
[  65.268 ms] k= 84000 sum=     9095726603620478688
[  66.408 ms] k= 85000 sum=     9536717580904610040
[  66.726 ms] k= 86000 sum=     9993536125103095688
[  67.575 ms] k= 87000 sum=    10466432130003173312
[  68.633 ms] k= 88000 sum=    10956004885713459408
[  68.506 ms] k= 89000 sum=    11462647299842391920
[  69.760 ms] k= 90000 sum=    11986403043182154584
[  70.185 ms] k= 91000 sum=    12528192303586301056
[  71.192 ms] k= 92000 sum=    13088114114334229944
[  72.118 ms] k= 93000 sum=    13666442859056921956
[  72.514 ms] k= 94000 sum=    14263784093996812372
[  73.450 ms] k= 95000 sum=    14880480246958258764
[  75.154 ms] k= 96000 sum=    15516955556268238668
[  75.907 ms] k= 97000 sum=    16173675941972720104
[  76.416 ms] k= 98000 sum=    16851041998775635308
[  76.931 ms] k= 99000 sum=    17549420431062508128
[  77.511 ms] k=100000 sum=    18269345553999897648

It can be speeded up a bit more by removing the powers of primes from LUT. However for big k this will not fit into memory so in such case you need 1D prime LUT and do the decomposition on the run which is much slower but does not need as much memory. IIRC there where some fast methods of prime decomposition when list of primes up to sqrt(k) is known using FFT or NTT (unless I miss match it in memory with some different problem)

"This is much faster then previous method" - How much faster? — superb rain, Jan 24 '21 at 12:33
Thanks a lot @Spektre, really helped ! It's more than 2 times faster than my method ! — Kylo TheDev, Jan 24 '21 at 12:45
Also, I'm having trouble compiling this. How do I make `uint` and `List` compile? (I'm not a C++ guy) — superb rain, Jan 24 '21 at 12:50
@Spektre I'm even less of a javascript guy than a C++ guy :-). How do I make it compile as C++? Why would it be "sloooow" without pre-allocation, and *how* slow is "sloooow"? — superb rain, Jan 24 '21 at 13:33
@superbrain as any adding to the list might invoke reallocation which leads to copy of the whole list of lists in some cases (depend on implementation)... How much faster than first method? That is hard to say because its really dependent on the big int implementation used behind this (you know basic operations are not `O(1)` anymore. For small numbers its the difference between lcm(i,j) using naive gcd and or prime decomposition... The naive one is `O(log(i+j))` however on bigints not anymore as it uses `*,%` using prime decomposition LUT get rid of the division which is huge difference? — Spektre, Jan 24 '21 at 13:34
Well it's important to my compiler. It's rather annoying having to fix the code just to try it out. What's "LUT"? (And btw what's "SoE"?) — superb rain, Jan 24 '21 at 13:37
@superbrain see the edit1 and **SoE** is Sieve of Eratosthenes which is a must know algo for problems like this ... its the fastest known way of finding all primes up to some `n` and can be used also for fast prime decomposition and speeding up GCD and other algorithms ... More advanced/fast stuff also combines SoE and FFT/NTT The **LUT** means Look Up table which is a basic programming therm like BYTE, loop , condition .... and its usually represented by arrays in any dimensionality — Spektre, Jan 24 '21 at 14:01
Thanks. In the meantime I [made it work as well](https://repl.it/@sp0/sums-of-lcms#main.cpp). For k = 10000, your "much faster" method was about factor 1.28 faster (5.7 seconds vs 7.3 seconds). Maybe eventually, with much larger k, it would indeed be "much faster". Then again, with much larger k, *both* would be very slow. No chance to solve their example k = 69533550916004. — superb rain, Jan 24 '21 at 14:19
(And I know the sieve of Eratosthenes and lookup tables, but I don't think those acronyms are common (enough) to just use them and expect everybody to understand.) — superb rain, Jan 24 '21 at 14:22
Got [some more ideas](https://repl.it/@sp0/sum-of-lcms#main.py) but not enough time / interest (since the question is lacking and OP can't even be bothered to reply). Solution 2 might be similar to yours but perhaps faster, and solution 3 might actually be fast. — superb rain, Jan 24 '21 at 14:32
@KyloTheDev I reedited my question and the code so its the same for native `int` and bigints (its just matter of changeing the typedef at top) and added also fusion between my `#3` and David Eisenstat's approach. The `sum_of_lcm0` is ~2x faster, the `sum_of_lcm1` is ~20x faster than yours (for small k ... for bigger ones its even better) and `sum_of_lcm2` is in entirely different class of speeds as the complexity is much better. However leave the other answer as accepted as that is the correct answer. I just wanted to try it and post C++ code for those like me that do not read Python. — Spektre, Jan 25 '21 at 14:23

What is the most performant method to calculate the sum of lcm(i, j), i and j ranging from 1 to k?

2 Answers2