From FASTA file, extract only entries with specified taxonomy

Question

I would like to extract all the entries of a fasta file that are from human taxonomy and make those entries into a new smaller fasta file. I'm trying to use R, but I'm not sure how to do it.

Two entries from the fasta file are below:

>sp|Q4R572|1433B_MACFA 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=2 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN

>sp|Q9CQV8|1433B_MOUSE 14-3-3 protein beta/alpha OS=Mus musculus GN=Ywhab PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLILNATQAESKVFY
LKMKGDYFRYLSEVASGENKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN

Oka · Answer 1 · 2019-03-27T22:16:04.563

If you want to do it in R there are functions readAAstringset and readFASTA from BioStrings package, and also read.fasta from seqinr which would allow you to read the file to R. Then you can trim it the way you like and output (both packages have output functions as well).

You can find information about these functions and packages here and here.

As fasta is ultimately a text file, you can also do it with base R functions as described here, but it is not recommended.

From FASTA file, extract only entries with specified taxonomy

1 Answers1