I'd like to extract a subset of protein sequences from a .fasta file (swissprot_canonical-isoforms.fasta
) in a new file (selected_proteins.fasta
) if protein IDs are listed in a txt file (Interested proteins.txt
).
Below shows part of protein sequences in the swissprot_canonical-isoforms.fasta
. Protein IDs are shown between two "|" in the lines that start with ">". For example, "P04637" is a protein ID.
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=4 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD >sp|P04637-2|P53_HUMAN Isoform 2 of Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQDQTSFQKENC >sp|P04637-3|P53_HUMAN Isoform 3 of Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQMLLDLRWCYFLINSS
Here are some protein IDs listed in the Interested proteins.txt
Q6ZWH5
Q8NG66
P51955
P51957
P04629
The final output should look like this (only listed the sequence for Q6ZWH5 as an example) :
>sp|Q6ZWH5|NEK10_HUMAN Serine/threonine-protein kinase Nek10 OS=Homo sapiens GN=NEK10 PE=2 SV=3
MPDQDKKVKTTEKSTDKQQEITIRDYSDLKRLRCLLNVQSSKQQLPAINFDSAQNSMTKS
EPAIRAGGHRARGQWHESTEAVELENFSINYKNERNFSKHPQRKLFQEIFTALVKNRLIS
REWVNRAPSIHFLRVLICLRLLMRDPCYQEILHSLGGIENLAQYMEIVANEYLGYGEEQH
TVDKLVNMTYIFQKLAAVKDQREWVTTSGAHKTLVNLLGARDTNVLLGSLLALASLAESQ
ECREKISELNIVENLLMILHEYDLLSKRLTAELLRLLCAEPQVKEQVKLYEGIPVLLSLL
HSDHLKLLWSIVWILVQVCEDPETSVEIRIWGGIKQLLHILQGDRNFVSDHSSIGSLSSA
NAAGRIQQLHLSEDLSPREIQENTFSLQAACCAALTELVLNDTNAHQVVQENGVYTIAKL
ILPNKQKNAAKSNLLQCYAFRALRFLFSMERNRPLFKRLFPTDLFEIFIDIGHYVRDISA
YEELVSKLNLLVEDELKQIAENIESINQNKAPLKYIGNYAILDHLGSGAFGCVYKVRKHS
GQNLLAMKEVNLHNPAFGKDKKDRDSSVRNIVSELTIIKEQLYHPNIVRYYKTFLENDRL
YIVMELIEGAPLGEHFSSLKEKHHHFTEERLWKIFIQLCLALRYLHKEKRIVHRDLTPNN
IMLGDKDKVTVTDFGLAKQKQENSKLTSVVGTILYSCPEVLKSEPYGEKADVWAVGCILY
QMATLSPPFYSTNMLSLATKIVEAVYEPVPEGIYSEKVTDTISRCLTPDAEARPDIVEVS
SMISDVMMKYLDNLSTSQLSLEKKLERERRRTQRYFMEANRNTVTCHHELAVLSHETFEK
ASLSSSSSGAASLKSELSESADLPPEGFQASYGKDEDRACDEILSDDNFNLENAEKDTYS
EVDDELDISDNSSSSSSSPLKESTFNILKRSFSASGGERQSQTRDFTGGTGSRPRPALLP
LDLLLKVPPHMLRAHIKEIEAELVTGWQSHSLPAVILRNLKDHGPQMGTFLWQASAGIAV
SQRKVRQISDPIQQILIQLHKIIYITQLPPALHHNLKRRVIERFKKSLFSQQSNPCNLKS
EIKKLSQGSPEPIEPNFFTADYHLLHRSSGGNSLSPNDPTGLPTSIELEEGITYEQMQTV
IEEVLEESGYYNFTSNRYHSYPWGTKNHPTKR
Is there any way to do this with python? Any help would be highly appreciated.