Running Count is Slow in Google Sheets

Question

Here's my way of calculating running count by groups in Sheets:

=LAMBDA(a,INDEX(if(a="",,COUNTIFS(a,a,row(a),"<="&row(a)))))(B4:B)

The complexity of this formula is R^2 = 1000000 operations for 1K rows. I'd love to make more efficient formula, and tried combinations of LABMDA and SCAN. For now I've found only the way to do it fast with 1 group at a time:

=INDEX(IF(B4:B=" Corn",SCAN(0,B4:B,LAMBDA(i,v,if(v=" Corn",i+1,i))),))

Can we do the same for all groups? Do you have an idea?

Note: the script solution would use object and hash to make it fast.

Legal Tests

We have a list of N items total with m groups. Group m(i) is a unique item which may repeat randomly. Samlpe dataset:

a
b
b
b
a

↑ Sample for 5 items total and 2 groups: N=5; m=2. Groups are "a" and "b"

The task is to find the function which will work faster for different numbers of N and m:

Case #1. 1000+ accurances of an item from a group m(i)
Case #2. 1000+ different groups m
General case sagnificant number of total items N ~ 50K+

Playground

Samlpe Google Sheet with 50K rows of data. Please click on the button 'Use Tamplate':

Test Sheet with 50K values

Speed Results

Tested solutions:

Countifs from the question and Countif and from answer.
Xlookup from answer
Complex Match logic from answer
Sorting logic from the answer

In my enviroment, the sorting option works faster than other provided solutions. Test results are here, tested with the code from here.

_find the function which will work faster for different numbers of `N` and `m`_ — it may be that the three cases you present would each require a solution of their own. The performance of most answers below depends a lot on the number of groups in data. — doubleunary, Nov 19 '22 at 14:48

Max Makhrov · Answer 1 · 2022-11-15T08:12:46.103

`Transpose` groups `m` = 5

I've found a possible way for a small amount of counted groups.

In my tests: 20K rows and 5 groups => cumulative count worked faster with this function:

INDEX(if(B4:B="",,LAMBDA(eq,BYROW(index(TRANSPOSE(SPLIT(TRANSPOSE(BYCOL(eq,LAMBDA(c,query("-"&SCAN(0,c,LAMBDA(i,v,i+v)),,2^99))))," -"))*eq),LAMBDA(r,sum(r))))(--(B4:B=TRANSPOSE(UNIQUE(B4:B))))))

It's ugly, but for now I cannot do a better version as bycol function does not produce arrays.

Apps Script

The perfect solution would be to have "hash"-like function in google-sheets:

/** runningCount
 * 
 * @param {Range} data
 * 
 * @CustomFunction
 * 
 */
function runningCount(data) {
  var obj = {};
  var l = data[0].length;
  var k;
  var res = [], row;
  for (var i = 0; i < data.length; i++) {
    row = []
    for (var ii = 0; ii < l; ii++) {
      k = '' + data[i][ii];
      if (k === '') {
        row.push('');
      } else {
        if (!(k in obj)) {
          obj[k] = 1;
        } else {
          obj[k]++;
        }
        row.push(obj[k]);
      }
    }
    res.push(row);
  }
  return res;
}

Max Makhrov · Accepted Answer · 2022-11-21T04:54:47.080

`Sorting` algorithm

The idea is to use SORT in order to reduce the complexity of the calculation. Sorting is the built-in functionality and it works faster than countifs.

Sort columns and their indexes
Find the place where each new element of a group starts
Create a counter of elements for sorted range
Sort the result back using indexes from step 1

Data is in range A2:A

1. Sort + Indexes

=SORT({A2:A,SEQUENCE(ROWS(A2:A))})

2. Group Starts

C2:C is a range with sorted groups

=MAP(SEQUENCE(ROWS(A2:A)),LAMBDA(v,if(v=1,0,if(INDEX(C2:C,v)<>INDEX(C2:C,v-1),1,0))))

3. Counters

Count the item of each group by the column of 0/1 values, 1 - where group starts:

=SCAN(0,F2:F,LAMBDA(ini,v,IF(v=1,1,ini+1)))

4. Sort the resulting countes back

=SORT(H2:H,D2:D,1)

The Final Solution

Suggested by Tom Sharpe:

cut out one stage of the calculation by omitting the map and going straight to a scan like this:

=LAMBDA(a,INDEX(if(a="",, LAMBDA(srt, SORT( SCAN(1,SEQUENCE(ROWS(a)), LAMBDA(ini,v,if(v=1,1,if(INDEX(srt,v,1)<>INDEX(srt,v-1,1),1,ini+1)))), index(srt,,2),1) ) (SORT({a,SEQUENCE(ROWS(a))})))))(A2:A)

↑ In my tests this solution is faster.

I pack it into the named function. Sample file with the solution: https://docs.google.com/spreadsheets/d/1OSnLuCh-duW4eWH3Y6eqrJM8nU1akmjXJsluFFEkw6M/edit#gid=0

this image explains the logic and the speed of sorting:

↑ read more about the speed test

Hi @Max that seems pretty optimal to me - everything linear except the sort which will be log-lin and is very fast. Only suggestion is you could cut out one stage of the calculation by omitting the map and going straight to a scan like this =LAMBDA(a,INDEX(if(a="",, LAMBDA(srt, SORT( SCAN(1,SEQUENCE(ROWS(a)), LAMBDA(ini,v,if(v=1,1,if(INDEX(srt,v,1)<>INDEX(srt,v-1,1),1,ini+1)))), index(srt,,2),1) ) (SORT({a,SEQUENCE(ROWS(a))})))))(A2:A) — Tom Sharpe, Nov 20 '22 at 10:23
Hi @tom-sharpe! Thanks, in my tests your way even a bit faster. Will change the answer! — Max Makhrov, Nov 21 '22 at 04:52
Here is an application of your method https://stackoverflow.com/questions/74711785/google-sheets-latest-date-of-each-summary-value-conditional-formatting/74723296#74723296 — Tom Sharpe, Dec 08 '22 at 08:22

score 2 · Answer 3 · answered Nov 11 '22 at 16:30

2

You can try this:

=QUERY(
  REDUCE(
    {"", 0},
    B4:B10000,
    LAMBDA(
      acc,
      cur,
      {
        acc;
        cur, XLOOKUP(
               cur,
               INDEX(acc, 0, 1),
               INDEX(acc, 0, 2),
               0,
               0,
               -1
             ) + 1
      }
    )
  ),
  "SELECT Col2 OFFSET 1",
  0
)

A bit better than R^2. Works fast enough on 10 000 rows. On 100 000 rows it works, but it is quite slow.

answered Nov 11 '22 at 16:30

kishkin

5,152
1
26
40

Hi @kishkin! Thank you very much for posting this and other answer. I think these solutions are genius. I comment this because I've started a bounty on this question and wanted to let you know about it. My tests are gathered in my answer. I've also made a new testing database here: https://docs.google.com/spreadsheets/d/1vQu7hVr7FwH8H5N8JOlOGfvjlJgKtpfoM2DPPjgUaLo/template/preview – Max Makhrov Nov 14 '22 at 09:08
1

Hi @kishkin! I've awarded this answer because it should be the most efficient way as it "repeats" HASH-way of working with data: find the first entry from the end by key. Pity, real tests shown this did not happen: my test files timed out with this formula. I suppose, `XLOOKUP` and other ways cannot be oprimized this way using Spreadsheets. – Max Makhrov Nov 21 '22 at 06:09

score 2 · Answer 4 · answered Nov 12 '22 at 13:40

Another approach. Works roughly 4 times faster than the first one.

=LAMBDA(
    shift,
    ref,
    big_ref,
    LAMBDA(
        base_ref,
        big_ref,
        ARRAYFORMULA(
            IF(
                A2:A = "",,
                    MATCH(VLOOKUP(A2:A, base_ref, 2,) + ROW(A2:A), big_ref,) - VLOOKUP(A2:A, base_ref, 3,)
            )
        )
    )
    (
        ARRAYFORMULA(
            {
                ref,
                SEQUENCE(ROWS(ref)) * shift,
                MATCH(SEQUENCE(ROWS(ref)) * shift, big_ref,)
            }
        ),
        big_ref
    )
)
(
    10 ^ INT(LOG10(ROWS(A:A)) + 1),
    UNIQUE(A2:A),
    SORT(
        {
            MATCH(A2:A, UNIQUE(A2:A),) * 10 ^ INT(LOG10(ROWS(A:A)) + 1) + ROW(A2:A);
            SEQUENCE(ROWS(UNIQUE(A2:A))) * 10 ^ INT(LOG10(ROWS(A:A)) + 1)
        }
    )
)

doubleunary · Answer 5 · 2022-11-21T14:18:24.447

Here's an implementation of kishkin's second approach that offloads much of the lookup table setup to lambdas early on. The changes in logic are not that big, but they seem to benefit the formula quite a bit:

5 uniques	5000 rows	4000 rows	3000 rows	2000 rows	1000 rows
lambda offload	14.87x	14.45x	10.04x	10.50x	7.05x
sort redux	7.73x	5.89x	4.89x	3.96x	2.24x
max makhrov sort	4.23x	4.52x	3.65x	3.31x	1.95x
array countifs	2.59x	2.66x	2.55x	2.56x	2.90x
kishkin2	0.83x	0.80x	0.81x	1.03x	1.19x
naïve countif	1.00x	1.00x	1.00x	1.00x	1.00x

I primarily tested using this benchmark and would welcome testing by others.

=arrayformula( 
  lambda( 
    groups, 
    lambda( 
      uniques, shiftingFactor, 
      lambda( 
        shiftedOrdinals, 
        lambda( 
          ordinalLookup, 
          lambda( 
            groupLookup, 
            iferror( 
              match( 
                vlookup(groups, groupLookup, 2, true) + row(groups), 
                ordinalLookup, 
                1 
              ) 
              - 
              vlookup(groups, groupLookup, 3, true) 
            ) 
          )( 
            sort( 
              { 
                uniques, 
                shiftedOrdinals, 
                match(shiftedOrdinals, ordinalLookup, 1) 
              } 
            ) 
          ) 
        )( 
          sort( 
            { 
              match(groups, uniques, 1) * shiftingFactor + row(groups); 
              shiftedOrdinals 
            } 
          ) 
        ) 
      )(sequence(rows(uniques)) * shiftingFactor) 
    )( 
      unique(groups), 
      10 ^ int(log10(rows(groups)) + 1)
    ) 
  )(A2:A) 
)

The formula performs best when the number of groups is small. Here are some benchmark results with a simple numeric 50k row corpus where the number of uniques differs:

50k rows	11 uniques	1000 uniques
lambda offload	14.41x	3.57x
array countifs	1.00x	1.00x

Performance degrades as the number of groups increases, and I even got a few incorrect results when the number of groups approached 20k.

Hy @doubleunary. Thank you for the affort, in my tests the formula works fast. But the results are wrong. Tests are here: https://docs.google.com/spreadsheets/d/1UGm5NcFkqpm340idCwKW2ZhEwtRr7mhv8Qg9QrGSK0g/edit#gid=0. Did I do it wrong? — Max Makhrov, Nov 21 '22 at 04:25
Your test seems good. It looks like the formula will get incorrect results when the first appearances of unique groups do not take place in Sheets lexicographical order. I may take another look at this later when I have time. — doubleunary, Nov 21 '22 at 14:16

score 1 · Answer 6 · edited Nov 10 '22 at 14:27

1

Mmm, it will probably be more efficient, but you'll have to try:

=Byrow(B4:B,lambda(each,if(each="","",countif(B4:each,each))))

or

=map(B4:B,lambda(each,if(each="","",countif(B4:each,each))))

Let me know!

edited Nov 10 '22 at 14:27

Max Makhrov

17,309
5
55
81

answered Nov 10 '22 at 13:33

Martín

7,849
2
3
13

Thanks, Martín! This works fine. And the formula is small. It is still slow due to the way it counts: each cell by each cell. – Max Makhrov Nov 10 '22 at 14:28
Yes, Martín. Some business tasks as caclulating discount depending on the number of service purchased reqire this – Max Makhrov Nov 10 '22 at 14:53

Running Count is Slow in Google Sheets

Legal Tests

Playground

Speed Results

6 Answers6

`Transpose` groups `m` = 5

Apps Script

`Sorting` algorithm

1. Sort + Indexes

2. Group Starts

3. Counters

4. Sort the resulting countes back

The Final Solution

Linked

Running Count is Slow in Google Sheets

Legal Tests

Playground

Speed Results

6 Answers6

Transpose groups m = 5

Apps Script

Sorting algorithm

1. Sort + Indexes

2. Group Starts

3. Counters

4. Sort the resulting countes back

The Final Solution

Linked

`Transpose` groups `m` = 5

`Sorting` algorithm