0

So I have the string

sequence = 'MYNSIYGSPFPKINPKVRYKTALERAGFDTKPRNPFSSQRNASTGSLQASVKSPPITRQRNVSAAPSVPVTMKSAYTASSKSAYSSVKGESDIYPPPVLENSERRSVTPPKNSNFTSSRPSDISRSISRPSERASQEDPFRFERDLDRQAEQYAASRHTCKSPANKEFQAADNFPFNFEQEDAGNTEREQDLSPIERSFMMLTQNDTASVVNSMNQTDNRGVLDQKLGKEQQKEESSIEYESEGQQEDENDIESLNFEPDPKLQMNLENEPLQDDFPEAKQEEKNTEPKIPEINVTRESNTPSLTMNALDSKIYPDDNFSGLESSKEQKSPGVSSSSTKVEDLSLDGLNEKRLSITSSENVETPYTATNLQVEQLIAQLDDVSLSRNAKLDMNGNCLNAVDRKASRFKKSSAYLSGYPSMDIPVTQQTSIVQNSNTNLSRQTILVDKGDVDEDAPSESTTNGGTPIFYKFKQSNVEYSNNEGMGSQETFRTKLPTIEALQLQHKRNITDLREEIDNSKSNDSHVLPNGGTTRYSSDADYKETEPIEFKYPPGEGPCRACGLEVTGKRMFSKKENELSGQWHRECFKCIECGIKFNKHVPCYILGDEPYCQKHYHEENHSICKVCSNFIEGECLENDKVERFHVDCLNCFLCKTAITNDYYIFNGEIPLCGNHDMEALLKEGIDNATSSNDKNNTLSKRRTRLINFN'

I want to split this string after every 'K', and 'R', except when either letter is followed by a 'P'. What might be the easiest way to do this?

To reiterate: Split string at 'K', and 'R', not at 'KP', 'RP'.

Real Person
  • 79
  • 1
  • 6

1 Answers1

4

Try splitting with a negative lookahead:

re.split(r'[KR](?!P)', sequence)

This would be the correct answer if you want to split and consume/remove the K or R letter in the process. If you instead want to split whenever KR precedes (and P does not proceed) while retaining all characters, then we can't just do a simple re.split, because it does not support lookbehinds.

One workaround might be to first do a replace all and insert marker symbols at every place where a split should happen, e.g. $, which does not appear anywhere in your current input. Then, we can do a simple split on this marker character to get the result you want.

sequence = 'MYNSIYGSPFPK...'
seq_new = re.sub("([KR])([^P])", r'\1$\2', sequence)  # insert '$' markers
result = re.split(r'\$', seq_new)         # split at '$' and consume the '$'
print result

['MYNSIYGSPFPK', 'INPK', 'VR', 'YK', 'TALER', 'AGFDTKPR', 'NPFSSQR', ...

Demo

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360