Splitting string at specific letters in python except when followed by another letter

Question

So I have the string

sequence = 'MYNSIYGSPFPKINPKVRYKTALERAGFDTKPRNPFSSQRNASTGSLQASVKSPPITRQRNVSAAPSVPVTMKSAYTASSKSAYSSVKGESDIYPPPVLENSERRSVTPPKNSNFTSSRPSDISRSISRPSERASQEDPFRFERDLDRQAEQYAASRHTCKSPANKEFQAADNFPFNFEQEDAGNTEREQDLSPIERSFMMLTQNDTASVVNSMNQTDNRGVLDQKLGKEQQKEESSIEYESEGQQEDENDIESLNFEPDPKLQMNLENEPLQDDFPEAKQEEKNTEPKIPEINVTRESNTPSLTMNALDSKIYPDDNFSGLESSKEQKSPGVSSSSTKVEDLSLDGLNEKRLSITSSENVETPYTATNLQVEQLIAQLDDVSLSRNAKLDMNGNCLNAVDRKASRFKKSSAYLSGYPSMDIPVTQQTSIVQNSNTNLSRQTILVDKGDVDEDAPSESTTNGGTPIFYKFKQSNVEYSNNEGMGSQETFRTKLPTIEALQLQHKRNITDLREEIDNSKSNDSHVLPNGGTTRYSSDADYKETEPIEFKYPPGEGPCRACGLEVTGKRMFSKKENELSGQWHRECFKCIECGIKFNKHVPCYILGDEPYCQKHYHEENHSICKVCSNFIEGECLENDKVERFHVDCLNCFLCKTAITNDYYIFNGEIPLCGNHDMEALLKEGIDNATSSNDKNNTLSKRRTRLINFN'

I want to split this string after every 'K', and 'R', except when either letter is followed by a 'P'. What might be the easiest way to do this?

To reiterate: Split string at 'K', and 'R', not at 'KP', 'RP'.

Please show us what your expected output is. And you don't need such a long sample input string in your question. — Tim Biegeleisen, Jul 30 '18 at 13:45

Tim Biegeleisen · Accepted Answer · 2018-07-30T13:54:06.707

4

Try splitting with a negative lookahead:

re.split(r'[KR](?!P)', sequence)

This would be the correct answer if you want to split and consume/remove the K or R letter in the process. If you instead want to split whenever KR precedes (and P does not proceed) while retaining all characters, then we can't just do a simple re.split, because it does not support lookbehinds.

One workaround might be to first do a replace all and insert marker symbols at every place where a split should happen, e.g. $, which does not appear anywhere in your current input. Then, we can do a simple split on this marker character to get the result you want.

sequence = 'MYNSIYGSPFPK...'
seq_new = re.sub("([KR])([^P])", r'\1$\2', sequence)  # insert '$' markers
result = re.split(r'\$', seq_new)         # split at '$' and consume the '$'
print result

['MYNSIYGSPFPK', 'INPK', 'VR', 'YK', 'TALER', 'AGFDTKPR', 'NPFSSQR', ...

Demo

edited Jul 30 '18 at 13:54

answered Jul 30 '18 at 13:43

Tim Biegeleisen

502,043
27
286
360

But that would remove the `K` or `R`, instead of splitting after it. – Tomalak Jul 30 '18 at 13:46
@Tomalak Yes, which is why I'm surprised so many people upvoted this. – Tim Biegeleisen Jul 30 '18 at 13:46
@Tomalak on the other hand it says _split at_ in the last sentence.... – Sebastian Proske Jul 30 '18 at 13:47
@SebastianProske `re.split` does not directly support lookbehinds, but there are a few workarounds we can do. – Tim Biegeleisen Jul 30 '18 at 13:55
@Tim Not sure why you ping me? If you want to have https://stackoverflow.com/questions/2713060/why-doesnt-pythons-re-split-split-on-zero-length-matches added to the duplicate list, you should rather ping Wiktor. – Sebastian Proske Jul 30 '18 at 14:12
@SebastianProske. You guys are both European, I might have gotten you mixed up. – Tim Biegeleisen Jul 30 '18 at 14:13
Works just fine for my purposes, thanks :) – Real Person Jul 30 '18 at 14:16

Splitting string at specific letters in python except when followed by another letter

1 Answers1

Demo