5

How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string?

Substring: SSQP

String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ

Here is an example where only one possible mismatch is incorporated:

>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
>>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP']

Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations.

How can I extend this code (or refactor this code) to explore the possibility of two mismatches?

Furthermore, I want to modify my output so that I get the numeric index returned (not SSQQ or SSQP) of the exact position the substring matched the string.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
warship
  • 2,924
  • 6
  • 39
  • 65

2 Answers2

6

You don't have to use re here you can use itertools module instead and save a lot of memory.

You can first extract all sub-strings with length 4 then compare them with your substring and just select those that have less that 2 difference with your substring :

from itertools import izip,islice,tee

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            yield ''.join(next(new))

Demo:

s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'

substring='SSQP'
print list(sub_findre(s,substring,2))

['SSPQ', 'SPQQ', 'QQQP', 'SSSS', 'SSSQ', 'SSQQ', 'SQQQ', 'SSQP', 'PSQS', 'SSQP', 'SSQP', 'SQPP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQ']

If you want to return the indices you need to put the indices in izip which you can use itertools.repeat() to repeat the index with the length of substring :

from itertools import izip,islice,tee,repeat

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            next(new)
            yield next(new)[0]

Demo:

print list(sub_findre(s,substring,2))
[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 1
    Indeed, regular expressions are just the wrong tool to use altogether. For 2 mistakes out of 20, there would be 190 alternates in the pattern. – 200_success Jul 12 '15 at 07:47
  • Can you return index numbers, similar to `match.start(0)` technique of 200_success? – warship Jul 12 '15 at 07:59
  • @Kasra Accepted! Do you mind if in the next few days I upload an algorithm I'm currently developing to solve this problem as well? I think I may have an alternate solution using a different technique than the very interesting technique you use here. – warship Jul 12 '15 at 08:12
  • @warship mm Ok, i'll try ;) but still i think may answer can be optimized,using generators is very efficient way in terms of memory use, although its very faster than regex, I think using a data structure like dictionary can be better since it use hash-table and you can get more speed from. – Mazdak Jul 12 '15 at 08:21
  • @Kasra Cool! If I figure out something interesting as well or an alternative approach, I will post my solution as an answer and you can evaluate it and let me know what you think :-) – warship Jul 12 '15 at 08:23
  • @warship Sure! just tag me!;) – Mazdak Jul 12 '15 at 08:24
3

The combinatorial explosion is not that bad for two mismatches out of four.

First, observe that you can omit SSQP itself, since it's covered by all of the more lenient cases.

re.findall(r'(?=([A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)

So, the number of cases is

               4!
C(4, 1) = ––––––––––––– = 4
          1! * (4 - 1)!

For up to two mismatches, the number of cases is

               4!
C(4, 2) = ––––––––––––– = 6
          2! * (4 - 2)!

Namely,

re.findall('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)

(To simplify the illustration, I've taken the liberty of writing . instead of [A-Z].)


To get the positions of the matches instead of the text of the matches:

[match.start(0) for match in re.finditer('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)]
200_success
  • 7,286
  • 1
  • 43
  • 74
  • I see what you're doing, but I have substrings as large as 10 sometimes 20 letters, sometimes more, it's very hard for me to scale using the `re` module. Perhaps there is something else I can use besides for regular expressions? I like your wording of "combinatorial explosion" though, I will put that terminology in my arsenal :-) – warship Jul 12 '15 at 07:20
  • 1
    Then why did you ask the question for C(4,2) instead of what you actually wanted? How many mistakes out of 20 are you talking about? – 200_success Jul 12 '15 at 07:27
  • 0, 1, or 2 mistakes in a substring of length 20 are possible. Writing such a thing out combinatorially is a royal pain. In the OP, I simply wanted to provide a MWE. – warship Jul 12 '15 at 07:30