What is the most efficient way to get the length of common prefix between two integers?

Question

What is the most efficient way to get the length of common prefix and extract out the uncommon suffixes.

int a =  866559999;
int b =  866560000;

Basically I am looking to extract out 60000 and 59999 in the above example possibly in O(1) time i.e. without traversing the number and taking mod by 10. Probably this can be done using bitwise operator.

Bitwise operators could do it if you wanted the length of the common prefix of the binary representation, or hexadecimal with some modifications, decimal seems difficult — harold, May 01 '22 at 06:23
Bitwise operations won't do it, unless you're trying to extract a "suffix" that is represented in binary (or hex or octal, which are both bases that are powers of two). If you want your "suffix" to be in decimal digits, then you'll need to traverse the number taking mod 10. — Peter, May 01 '22 at 06:26
@Evg - Not if "traversing" involves multiple operations (in this case, the number of operations will tend to be proportional to the number of decimal digits to be extracted, so the complexity will be O(digits_to_extract), not O(1)). — Peter, May 01 '22 at 06:33
@Peter But the number of digits of an `int` number is limited by a constant. — Evg, May 01 '22 at 06:38
If you change *"possibly in O(1) time ie without traversing the number and taking mod by 10"* to *"possibly without traversing the number and taking mod by 10"*, you might preserve your intent while getting rid of the (valid) technicality that the operation you want to avoid is O(1) since the number of digits in an `int` is bounded. — JaMiT, May 01 '22 at 07:06
Bob McCheese, with `866560000` and `866560`, is the common length 6? What about `866560000`, `-866560000` or `-123`, `-456`? — chux - Reinstate Monica, May 01 '22 at 07:07
This reminds me of my leetcode days. I think one algorithm to do this is called the Knuth-Morris-Pratt algorithm. — Mr. Fegur, May 01 '22 at 07:07
"without ... taking mod by 10" --> Hmm, they are base 10 forms of the integers. Somehow `10` will be involved. — chux - Reinstate Monica, May 01 '22 at 07:09
@chux-ReinstateMonica yes if you can form any trick with a `1010` and bitwise that will be always welcome. — ishandutta2007, May 01 '22 at 07:17
How do you define "most efficient"? It looks like you are interested only in time efficiency, in which case space efficiency is not a concern? A highly time-efficient method would be a lookup table, requiring only 2 exabytes of storage, assuming 32-bit integers. That's not too much to ask, is it? :) — JaMiT, May 01 '22 at 07:17
@BobMcCheese What about [this](https://stackoverflow.com/questions/72074448/what-is-the-most-efficient-way-to-get-the-length-of-common-prefix-between-two-in?noredirect=1#comment127350092_72074448)? — chux - Reinstate Monica, May 01 '22 at 07:26
when you want to do something for `N` digits the algorithm is necessarily `O(N)`. — 463035818_is_not_an_ai, May 01 '22 at 07:59
Do you have too many repeats of this function over N elements (of array a and array b) or is it just called once in a while? Because with vectorization you get 4x-16x speedup that should compensate for the traversal about 10 digits. — huseyin tugrul buyukisik, May 01 '22 at 08:05
@huseyintugrulbuyukisik a,b is 32 bit ie 10^9 . the difference between a and b is ~1000 in worst case ~100 in average case. So for a list with 10 digit sorted numbers when I am checking two consecutive number they are ~100 apart then obviously 7-8 out of the 10 leading digits will be same.(In example its not the case though , in example difference is 1 but only 4 digits are common) — ishandutta2007, May 01 '22 at 08:20
@huseyintugrulbuyukisik can you explain with code about the speedup tricks via vectorizaton that you are talking about. — ishandutta2007, May 01 '22 at 08:40
@Evg - Even for built-in types (`int`, `long`, etc) the range of values - therefore the number of digits - is implementation defined. But, yeah, for a specified implementation - and assuming that some "big integer" type isn't being used to represent integral values of some arbitrary range - there is an upper bound on number of digits and therefore on the steps needed (once an algorithm is selected). That isn't the definition of `O(1)` though. — Peter, May 01 '22 at 13:16
A primitive approach with division by powers of 10 can be done in O(log N) by a binary search over the highest unequal digit. E.g. divide both numbers by 1000 (or use the multiplication trick) and compare the results. — Sebastian, May 01 '22 at 14:26
@JaMiT: How long do you think a lookup from two exabytes takes, compared to calculating the result? — Eric Postpischil, May 01 '22 at 15:04
@EricPostpischil I couldn't say how it compares because I don't know how much time the calculation requires. For example, I cannot rule out someone designing a CPU with an instruction that does the calculation in one cycle, and there is no beating that. Maybe take the lookup table as a goal to measure the calculation against? I am envisioning the entire lookup table sitting in RAM (not even swapping to disk), so the access time should be independent of table size, aside from caching effects. Can someone devise a calculation that beats (uncached) main memory lookup? — JaMiT, May 01 '22 at 17:46
@JaMiT I would guess my answer implemented as SIMD would take way < 10ns to compute, but it needs a few constants loaded. — Sebastian, May 01 '22 at 18:01
@JaMiT: Even with all data in RAM, a load that is not in cache (as we must expect for a two exabyte table) [may take hundreds of cycles](https://stackoverflow.com/a/65697590/298225). It will not be faster than calculation for this function. — Eric Postpischil, May 01 '22 at 18:04
@EricPostpischil Well, since part of my intent was to push things ludicrously large for current technology (hence the smiley in my earlier comment), I'll just add the assumption that the L1 cache can hold the entire table. That knocks the time down to 4-6 cycles, if I've read your reference's reference correctly. That's not too much to ask, is it? :) *(If it is, we get back to the other part of my intent -- emphasizing that "O(1) time" is not really a good goal for this problem, hence the words "in O(1) time i.e." should be removed from the question.)* — JaMiT, May 01 '22 at 18:42

huseyin tugrul buyukisik · Answer 1 · 2022-05-01T14:15:30.370

You can still vectorize the integer - to - string conversion to have speedup.

Here is auto-vectorized integer-to-string conversion with some benchmarking:

// https://github.com/tugrul512bit/VectorizedKernel/blob/main/VectorizedKernel.h
#include "VectorizedKernel.h" 
#include <iostream>
#include <string>



int main()
{

    constexpr int simd =16;


    auto kernelVectorized = Vectorization::createKernel<simd>(
        [&](auto & factory,
            auto & idThread,
            unsigned int * bufferIn,
            char * bufferOut
            )
        {
            const int currentSimdWidth = factory.width;
            auto digitsReversed = factory.template generateArray<unsigned int,12>();
            auto digitAsChar = factory.template generate<char>();
            auto tmp = factory.template generate<unsigned int>();
            auto tmp2 = factory.template generate<unsigned int>();
            auto tmp3 = factory.template generate<unsigned int>();
            auto value = factory.template generate<unsigned int>();
            auto writeIndex = factory.template generate<int>();

            
            
            for(int j=0;j<N;j+=simd)
            {
                // null termination
                digitsReversed[0].broadcast(0);
                writeIndex.readFrom(idThread);
                value.readFrom(bufferIn+j,idThread);

                // 11 digits!!??
                for(int i=1;i<11;i++)
                {
                    // fast-modulo to re-use the division
                    // value % 10 --> tmp (digit)
                    value.div((unsigned int)10,tmp3);
                    tmp3.mul((unsigned int)10,tmp2);
                    value.sub(tmp2,tmp);

                    // value /= 10
                    value.readFrom(tmp3);

                    // write digit character
                    tmp.add((unsigned int)'0',digitsReversed[i]);
                }

                writeIndex.mul(12,writeIndex);


                for(int i=0;i<11;i++)
                {
                    digitsReversed[11-i-1].template castAndCopyTo<char>(digitAsChar);
                    digitAsChar.writeTo(bufferOut+(j*12),writeIndex);
                    writeIndex.add((int)1,writeIndex);
                }
            }
        },
        /* defining kernel parameter types */
        Vectorization::KernelArgs<unsigned int*,char*>{});

    std::vector<unsigned int> testInput(N);
    std::vector<std::string> testOutput(N);
    std::vector<char> testOutputChar(N*12);
    for(int i=0;i<N;i++)
    {
        testInput[i]=50000000+i;
    }

    size_t t;
    for(int j=0;j<20;j++)
    {
        {
            Vectorization::Bench bench(&t);
            for(int i=0;i<N;i++)
            {
                 testOutput[i]=std::to_string(testInput[i]);
            }
        }
        std::cout<<t<<" nanoseconds (non-vectorized)"<<std::endl;
    }




    for(int i=0;i<20;i++)
    {
        {
            Vectorization::Bench bench(&t);
            kernelVectorized.run(simd,testInput.data(),testOutputChar.data());
        }
        std::cout<<t<<" nanoseconds (vectorized !!!)"<<std::endl;
    }

    for(int i=0;i<200;i++)
    {
        std::cout<<std::string(testOutputChar.data()+(i*12))<<" =?= "<<testOutput[i]<<std::endl;
    }

    return 0;
}

For FX8150, conversion for 1 million integers is 50 milliseconds with std::to_string and 32 milliseconds with vectorized version. On Cascadelake CPU, it is 24 milliseconds and 7.1 milliseconds respectively. These include writing zero-padded string to target string array.

If you are going to work with zero-padded format, then you need to skip the first n zeroes.

When string characters are not written to target, it has higher speedup but then to fully compute your algorithm, you still need to stringify the second parameter which will take another 7.1 milliseconds per 1 million inputs (making 14.2 milliseconds total) then do the comparison which would take similar time ~28 milliseconds total for Cascadelake CPU.

Same kernel with two variable converted at a time, from two input arrays:

https://godbolt.org/z/vdPhGK194

25384216 nanoseconds (non-vectorized)
25418696 nanoseconds (non-vectorized)
25396651 nanoseconds (non-vectorized)
9843377 nanoseconds (vectorized !!!)
8936126 nanoseconds (vectorized !!!)
8776456 nanoseconds (vectorized !!!)
8585255 nanoseconds (vectorized !!!)
8558483 nanoseconds (vectorized !!!)

You are right, editing answer. (also compiler producing the thing you say, ```vpsrlq zmm3, zmm2, 32 vpmuludq zmm3, zmm3``` — huseyin tugrul buyukisik, May 01 '22 at 14:14

Sebastian · Answer 2 · 2022-05-01T15:09:11.350

We can test for

a == b => 10/10 common digits, else
a/10 == b/10 => 9/10 common digits, else
a/100 == b/100 => 8/10 common digit, else
a/1000 == b/1000 => 7/10 common digits, else
a/10000 == b/10000 => 6/10 common digits, else
a/100000 == b/100000 => 5/10 common digit, else
a/1000000 == b/1000000 => 4/10 common digits, else
a/10000000 == b/10000000 => 3/10 common digits, else
a/100000000 == b/100000000 => 2/10 common digit, else
a/1000000000 == b/1000000000 => 1/10 common digit, else 0/10 common digits.

Those divisions by a constant can be optimized with inverse multiplication constants instead, and be parallelized with SIMD instructions for multiplication, rightshift and comparison.

By extracting the prime factor 2 of the powers of 10, part of the division can be done with boolean ANDing, reducing the number of needed bits for the multiplication.

In the end it would be branchless and only take few cycles.

For extracting out the deviating suffixes one would multiply one of the two inputs of the topmost successful comparison (they are the same, if the comparison is successful) with the divisor and subtract it from both a and b. If none is successful one would just return a and b themselves.

When a=100 and b=1, this outputs 7/10 common digits. Did you mean to start the if from bottom? — huseyin tugrul buyukisik, May 01 '22 at 15:44
It counts the leading 0 digits, too, for a 10-digit number (uint32_t). `a=0'000'000'100, b=0'000'000'001`. You could also say the `10-7=3` rightmost digits contain the difference. — Sebastian, May 01 '22 at 16:15

score 0 · Answer 3 · answered May 01 '22 at 07:48

0

A not so efficient an approach, but a baseline to compare. Does not use %. as requested by OP.

int count_common(unsigned a, unsigned b) {
  // Find power of 10 for a, b
  const unsigned pow10_max = 1000u*1000*1000;
  unsigned apow10 = pow10_max;
  while (a < apow10 && apow10 > 0) {
    apow10 /= 10;
  }
  unsigned bpow10 = pow10_max;
  while (b < bpow10 && bpow10 > 0) {
    bpow10 /= 10;
  }
  int count = 0;
  while (a > 0 && b > 0) {
    unsigned aquot = a/apow10;
    unsigned bquot = b/bpow10;
    if (aquot != bquot) break;
    count++;
    a -= aquot*apow10;
    apow10 /= 10;
    b -= bquot*bpow10;
    bpow10 /= 10;
  }
  return count;
}

answered May 01 '22 at 07:48

chux - Reinstate Monica

143,097
13
135
256

1

This is tongue-in cheek, right? Yeah, you're avoiding using the % operator. But with the repeated division by `10` and subtraction, your code is doing something quite similar (given how modulo is defined in terms of integer division). – Peter May 01 '22 at 13:23
@Peter, besides, any decent compiler will turn a `/10` into a multiplication. – Elliott May 01 '22 at 14:02

What is the most efficient way to get the length of common prefix between two integers?

3 Answers3