2

I have the following problem: Given 15 bit strings of 1024 bit each, what is the best way to find patterns in different strings on the same position? How the pattern looks like doesn't matter to me (length > 1 of course), I just want to find parts (as long as possible) where at least two of the strings match. An example:

  • 10010...
  • 00011...
  • 00101...

Here i would like to get 001 from the first two strings (position 2 to 4, frequency 2 would also be great...) and 00 from the second and third string (position 1 to 2).

I hope, the problem is clear now... Anybody an idea? Thanks!

python15
  • 23
  • 4
  • 2
    Hi and welcome to Stack Overflow, please take a time to go through the [welcome tour](https://stackoverflow.com/tour) to know your way around here (and also to earn your first badge), read how to create a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and also check [How to Ask Good Questions](https://stackoverflow.com/help/how-to-ask) so you increase your chances to get feedback and useful answers. – DarkCygnus Jul 06 '17 at 18:22
  • What if you treat them as strings and try to find the common longest substring? There are many algorithms for that out there. If the length is not too long you could also do it the with brute force – FortyTwo Jul 06 '17 at 18:51
  • But I want to find all common substrings that are on the same position in both strings. Didn't get how your approach fits to this problem... – python15 Jul 06 '17 at 18:58
  • A little more clarification please. Let's throw two more string into your example: 10011, 10000 . Then for the pattern 001 from position 2, you want it to dump out {10010, 00011, 10011} but it to **not** dump out smaller subsets such as {10010, 00011} or {00011, 10011}? Or do you want the latter too? Do you also want it to output { 00011, 10000 } which match with 00 at position 2, or is this not of interest because the pattern length is smaller than the 001 pattern at position 2? Or is it not of interest because there are only two matching strings here whereas the other has 3? – TheGreatContini Jul 06 '17 at 22:09
  • What do you mean by "best"? Fastest? Do you need just a programming-language neutral algorithm, or you have a specific language in mind? I can think of several solutions for this problem, but you need to narrow down the little bit your question. Does the input have special properties, or are they completely random? – geza Jul 06 '17 at 22:52

1 Answers1

1

I would try to solve your problem in a two-step approach. Since you only have 15 bitstrings, it should be sufficient to compare all pairs of bitstrings (105 comparisons á 16 QWORDS should be doable)

For now, I'll also consider patterns of length 1 valid, we'll see how to get rid of this later. Since you don't mention a programming language, I'll try to keep it general and use C syntax for bit operations. Let a and b be two bitstrings.

  1. Identify the possible patterns: The ith bit is part of a pattern if and only if a[i] == b[i]. We can mark these bits efficiently by computing the bitwise xor (which corresponds to !=) and negating: patterns = ~(a ^ b)
  2. Identify longest pattern: After the transformation above has been computed, common patterns between a and b are consecutive 1s in pattern. Finding such long patterns can e.g. be solved by repeated shifting and AND operations, see this question for details. If you expect really long sequences, I would try to use some dedicated instructions to find non-pattern bits, I'll extend the answer if necessary.

If this is really high-throughput code, you could try doing multiple comparisons simultaneously, i.e. using SIMD instructions, but that's a topic for another question.

Tobias Ribizel
  • 5,331
  • 1
  • 18
  • 33