6

I'm trying to find a fast algorithm that search the longest prefix of multiple bit arrays. In my application, those bit arrays can be infinitely long and of variable length. For example, if I have those bit arrays:

0b1011001
0b1001101
0b1001010
0b1010100

The longest prefix would be 10. I'm currently ORing and NANDing the bit arrays to find their common 0s and 1s and XORing the results together.

OR
0b1011111

NAND
0b0111111

XOR
0b1100000

Is there a any faster solution?

Mox
  • 2,355
  • 2
  • 26
  • 42
  • 2
    ORing all the arrays seems to be a quite expensive solution.. You can just scan the arrays left to right stopping when you find the first mismatch. This would be O(kn) where n is the number of arrays and k is the length of the common prefix. – Haile Aug 02 '12 at 11:56
  • @Haile ORing and NANDing are not so expensive in my case, because the bit arrays are represented internally by integers. Scanning left to right seems naive at a first glance... – Huy Dang Lê-Ngô Aug 06 '12 at 01:16
  • You wrote that "those bit arrays can be **infinitely long** and of variable length"! Also, it's not naive. For arbitrary long arrays, it's alot faster than ORing the whole array. – Haile Aug 06 '12 at 12:28

4 Answers4

2

On your approach

It scales well (linear) on the number of bit-arrays.

It does not scale very well on the size of the bit-arrays, ideally it should scale based on the length of the common prefix, instead of the sizes of the bit-arrays.

At a low level

Bit-Operations on individual bytes/words from a bit array should be much faster than walking along the bits one at a time. (Not sure how much low level control Python can give you, though).

First Suggestion

If this were a low-level language like C, I would solve this in a way similar to you, but with a couple of ideas from the other answers.

In my example, I will assume that the computer is a 64-bit machine.

I start by (OR NAND XOR) just the first 64-bits of each bit-array, (These are basic operations on a 64-bit machine, and complexity is just O( # of arrays ) ).

Then quickly find the position of the first set bit in the result, (most computers have some fast method of this built in or at least in optimized assembly code, for C, If there is a set bit, return that value.

Otherwise, repeat on the next 64-127 bits of each bit-arrays.

(You will need to deal somehow with bit-arrays of different lengths, probably by finding the minimum length bit-array of the bunch, and then using that as the max.)

The benefit of this approach is that it is linear in number of bit-arrays, and is linear is length of the common-prefix.

Second Suggestion

If there is a large number of bit-arrays, you can get a speed-up by using parallelism.

First you can run the OR at the same time as the NAND.

With more ingenuity, you could apply the following:

If I have 4 bit-arrays A,B,C,D

Instead of (((A OR B) OR C) OR D)

I could do (A OR B) OR (C OR D).

In both cases, the same number of OR are done.

But the second approach can be parallelized effectively (and actually doing it the second way may be faster on a single-core machine, since oftentimes the CPU will actually have more than one ALU anyway.)

Writing the logic out is a little trickier, since you can't use a single for-loop with a single temporary variable holding the result of the ORs.

You could imagine storing the sub-results in an array with length one half of the number of bit-arrays. Store the subresult of A OR B in array[0] and C OR D in array[1], then do array[0] OR array[1]. (and you could store that result in a new array of half the length, or overwrite values in your array to save space and memory allocations).

Partition the work into large enough chunks to keep the whole computer busy without introducing too much overhead.

With enough Processors you could approach a complexity of the log of the number of bit-arrays instead of linear. But in practice, getting a 5-times speed-up on a 6-core machine would probably be pretty good.

Community
  • 1
  • 1
Xantix
  • 3,321
  • 1
  • 14
  • 28
  • This is quite helpful. I am actually doing OR and NAND at the same time. I'm also using integers to store the bit arrays, so I'm doing bytes/words operations. The second suggestion is the kind of optimization I was looking for. – Huy Dang Lê-Ngô Aug 06 '12 at 01:13
1

You don't need to ORing or NANDing all the arrays (this would be quite expensive, since they are arbitrarily long). You can just scan the arrays left to right stopping when you find the first mismatch. This would be O(kn) where n is the number of arrays and k is the length of the common prefix.

My python is terrible, so I'll give only a very simple example with 2 arrays of fixed equal length just to make it clear:

a = [1,0,1,1,0,0,1]
b = [1,0,1,1,0,1,1]

for x in xrange(0,7):
    if a[x] != b[x]:
        print a[0:x]
        break

output:
[1, 0, 1, 1, 0]

Obviously you have to fix that, I think I'll be easy if you undestood the logic behind the code.

  • Iterate for x over the x-th bit of all the arrays until you find a mismatch (i.e. the arrays does not have all the same bit value), or until the end of the shortest array
  • Output the first x bits of array1.
Haile
  • 3,120
  • 3
  • 24
  • 40
0

The optimal solution in some cases is to use prefix trees it have complexity of O(n) where n is summ of shared prefixes of your binary strings but with large coefficient.

Odomontois
  • 15,918
  • 2
  • 36
  • 71
  • This works for any number of arrays, without exploding due to the increase in number of pairs, so I'll upvote it. (although, there's not really a need to restore all of the information in a different structure). – Xantix Aug 04 '12 at 06:58
  • Just realized that the OP didn't need the longest common prefix of 2 of the bit-arrays, but the longest prefix of all of them. – Xantix Aug 04 '12 at 07:14
0

Assuming you have input strings s1,s2,s3 ...

  1. Let s = commonSubString(s1,s2)
  2. For i=3..n
    1. s = commonSubString(s,s2)
  3. Return s

In the worst case s1 and s2 can be identical, in which case you can use heuristics (eg. Limit initial length of s to k=100 first. If the final s is still of length k=100, then repeat the whole process, but starting at location k+1 for every string).

ElKamina
  • 7,747
  • 28
  • 43