2

For example, I am given the following string of 0's and 1's : 001011. This is my pattern, or string A. Then, I am given a longer string B of 0's and 1's (example: 010100010010010) and my task is to find the the most appropriate match of string A in B, and I say ,,the most appropriate" because it doesn't need to be necessarily string A, it could have an error of maximum 20%.

For example: for string A: 01001, a good match would be 11001. 80% percent of string B matches string A, except the first bit. For the same A string, 11101 would match only 60% of it (the bits on the first and third position in 11101 do not match the first and the third bit in A) which is not the desire solution.

If N is the number of bits of string A, it means that I perform a check on an N-length sequence of B at once (the evaluated bits in B must be on consecutive positions, so this excludes a substring from B). For example: Let it be A- 01011 and B-010100100111. First we evaluate the sequence 01010 (the first 5 bits of B starting with the very first position), then 10100 (the first 5 bits starting with the second bit in B). In this example, in 01010 only 4 bits match with A, that means 01010 is a 80% match. Concerning 10100, no bit match with A, thus it is a 0% match.

I could have a case where A is: 01001 and B is: 01101 ( the first 2 bits from B match the first 2 bits of A and the last 2 bits of B match the last 2 bits A). Therefore, it is a 80% match.

If A is longer than B, then A has no match in B.


I would like to know an algorithm, strategy to attack this problem. I hope I made this problem as clear as possible, if not, I am going to modify of provide you with further explanations. I assume this problem might actually have some applications in real world in regard to pattern-matching. I need a solution and I am looking forward to improving the explanations as much as possible.

Andrew Stef
  • 61
  • 1
  • 7
  • 2
    I would start with traversing the string char by char, and comparing them with my search string. Maybe do multiple overruns, where you first try to find an exact match, and after search for slight modifications of your search string. This is a brute force approach, but it gets you started. optimizations follow afterwards. Maybe also look at regex and the way regexes are implemented. That may help also. – jwsc Feb 07 '17 at 12:15
  • 1
    Take a look at [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) – Yuriy Ivaskevych Feb 07 '17 at 12:22
  • You are comparing strings of 1's and 0's. Does that mean that you are comparing `String`s of 1's and 0's? If so, are you looking for a generalized algorithm that can be applied to any two `String`s, or do you only care about 1's and 0's? If you only care about 1's and 0's, then @Gijs answer is a good starting point. Otherwise, I agree that you should start with a brute force implementation (fairly trivial to do), and look for specific ways to optimize it. – Sam Hazleton Feb 07 '17 at 14:37

1 Answers1

1

So your actually just comparing binary numbers. So convert the string to a number and compare them binary. So use XOR bit operation (https://en.wikipedia.org/wiki/Bitwise_operation), (in C ^)

11001 ^ 11101 = 00100

0 ^ 0 = 0 
1 ^ 0 = 1
0 ^ 1 = 1
1 ^ 1 = 0

(this is taken from How to compare two bit values in C?. maybe this a duplicate?).

Added code in python, probably not optimal but might be usefull

def bitcount(n):
count=0
while(n):
    count+= (n & 1)     
    n >>=1

return count

a="01011"
b="010100010010010"

number= int(a, 2)
new = []
result=[]
for i in range(0, len(b)-len(a)):
    new.append(int(b[i:i+len(a)],2))

    compare = number ^ new[i]

    if(bitcount(compare) < int(0.2*len(a))+1):
       print(b[i:i+len(a)])

This scales with (len(b)-len(a))*len(a), I think. O(a b) correct me if i am wrong?..

Community
  • 1
  • 1
  • To get the match %, you would still need to divide the number of digits in string A by the number of 1's in the result. – Sam Hazleton Feb 07 '17 at 14:26
  • This is a good starting point, it is mainly brute force, so concerning the improvement of the algorithm, how would you do it using dynamic programming in order to execute less operations? – Andrew Stef Feb 07 '17 at 15:02
  • I would look into sequence analysis, something like DNA sequence comparison would do something similar. – Gijs Den Hollander Feb 07 '17 at 16:00