1

How can one read FASTA files directly into a data frame in R using base code. These files store information bio-sequence (e.g. DNA or protein) and have 2*n lines for n individual bio-molecules (id1 through idn), and thus are of the type:

>id1 #(always starts with a `>`) 
seq1
>id2
seq2
...
>idn
seqn

If one want to be in base R (instead of dedicated packages like Biostrings and seqinr, which make use of novel classes for various manipulations of bio-sequences), how can you use e.g. read.table , to get a simple data frame with a id and a seq column?

user3375672
  • 3,728
  • 9
  • 41
  • 70
  • @Andrie, yes in this they use Biostrings. I asked something different, namely if it can be done in base R, ie NOT using Biostrings and the like. – user3375672 Nov 10 '14 at 12:43
  • What's your problem with using specialised code that is designed to do the job you want to do? Its probably easier to use them to read the file and then convert to a simple data frame format (speaking as someone who wrote code to read these things about 8 years ago). – Spacedman Nov 10 '14 at 13:53
  • @Spacedman, No problem - but I like to think as much as possible in base code + as-few-as-possible specialized packages. JTT's solution below is a good example. – user3375672 Nov 11 '14 at 08:56

1 Answers1

2

It certainly is possible in base R. Consider the following example and function:

# Demo data
library(CHNOSZ)
file <- system.file("extdata/fasta/EF-Tu.aln", package="CHNOSZ")

# Function
ReadFasta<-function(file) {
   # Read the file line by line
   fasta<-readLines(file)
   # Identify header lines
   ind<-grep(">", fasta)
   # Identify the sequence lines
   s<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], length(fasta)))
   # Process sequence lines
   seqs<-rep(NA, length(ind))
   for(i in 1:length(ind)) {
      seqs[i]<-paste(fasta[s$from[i]:s$to[i]], collapse="")
   }
   # Create a data frame 
   DF<-data.frame(name=gsub(">", "", fasta[ind]), sequence=seqs)
   # Return the data frame as a result object from the function
   return(DF)
}

# Usage example
seqs<-ReadFasta(file)

However, be warned: the function does not currently handle, e.g., special characters, which are rather commonplace in sequence files (in context such as 5' or #5 rRNA).