i am new in the regex expression function with python. I have a file where i need to filter the aminoacid-sequence. Here is a quick look into the file:
>nxp:NX_A0A0A6YYD4-1 \PName=T cell receptor beta variable 13 isoform Iso 1 \GName=TRBV13 \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=124 \SV=5 \EV=31 \PE=3 \ModResPsi=(52|MOD:00798|half cystine)(120|MOD:00798|half cystine) \ModRes=(106||N-linked (GlcNAc...) asparagine) \VariantSimple=(18|H)(27|V) \Processed=(1|31|PEFF:0001021|signal peptide)(32|124|PEFF:0001020|mature protein) MLSPDLPDSAWNTRLLCRVMLCLLGAGSVAAGVIQSPRHLIKEKRETATLKCYPIPRHDT VYWYQQGPGQDPQFLISFYEKMQSDKGSIPDRFSAQQFSDYHSELNMSSLELGDSALYFC ASSL
>nxp:NX_A0A1B0GV90-1 \PName=Cortexin domain containing 2 isoform Iso 1 \GName=CTXND2 \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=55 \SV=1 \EV=11 \PE=3 \VariantSimple=(13|N)(22|F)(29|T)(34|Q)(45|T) \Processed=(1|55|PEFF:0001020|mature protein) MEDSSLSSGVDVDKGFAIAFVVLLFLFLIVMIFRCAKLVKNPYKASSTTTEPSLS
I could filter out the endpoint of the needed point and the starting point. As you can see in my code, the beginning ist after the coordinates after the \VarableSimple and the end should be the next ">" character. Now i cannot find the way to filter out the MLSPDLPD..... sequence. Could someone give me an idea?
with open('PATH/XYZ', 'r') as f:
data = f.read()
import regex
h = regex.compile("(.*)\n").match(data)
header = h.group(1)
start = regex.match(".+\\\VariantSimple=(\([^)]+\))*\s{0,1}", data)
start.captures(1)
end = regex.compile("(.)*\>").match(data)
end.captures(0)