Scaling regex on big strings in Python

Question

I'm trying to take Regex substring one mismatch in any location of string and turn it into a big data situation where I can:

Match all instances of big substrings such as SSQPSPSQSSQPSS (and allowing only one possible mismatch within this substring) to a much larger string such as SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS.

In reality, my substrings and the strings that I match them to are in the hundreds and sometimes even thousands of letters and I wish to incorporate the possibility of mismatches.

How can I scale the regex notation of Regex substring one mismatch in any location of string to solve my big data problems? Is there an efficient way to go about this?

score 0 · Accepted Answer · answered Jul 12 '15 at 05:07

You may try this,

>>> s = "SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS"
>>> re.findall(r'(?=(SSQPSPSQSSQPSS|[A-Z]SQPSPSQSSQPSS|S[A-Z]QPSPSQSSQPSS|SS[A-Z]PSPSQSSQPSS))', s)
['SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS']

Likwise add pattern with replacing remaining chars with [A-Z].

Scaling regex on big strings in Python

1 Answers1