0

I'm trying to take Regex substring one mismatch in any location of string and turn it into a big data situation where I can:

Match all instances of big substrings such as SSQPSPSQSSQPSS (and allowing only one possible mismatch within this substring) to a much larger string such as SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS.

In reality, my substrings and the strings that I match them to are in the hundreds and sometimes even thousands of letters and I wish to incorporate the possibility of mismatches.

How can I scale the regex notation of Regex substring one mismatch in any location of string to solve my big data problems? Is there an efficient way to go about this?

Community
  • 1
  • 1
warship
  • 2,924
  • 6
  • 39
  • 65

1 Answers1

0

You may try this,

>>> s = "SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS"
>>> re.findall(r'(?=(SSQPSPSQSSQPSS|[A-Z]SQPSPSQSSQPSS|S[A-Z]QPSPSQSSQPSS|SS[A-Z]PSPSQSSQPSS))', s)
['SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS']

Likwise add pattern with replacing remaining chars with [A-Z].

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274