4

I'm just starting out with R and trying to get a grasp of some of the built in functions. I'm trying to organize a basic FASTA text file that looks like this:

>ID1
AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
>ID2
TCCAATTAAGTCCCTATCCAGGCGCTCCG
>ID3
GAACCGGAGAACGCTTCAGACCAGCCCGGAC

Into a table that'd look something like this:

ID   Sequence
ID1  AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
ID2  TCCAATTAAGTCCCTATCCAGGCGCTCCG
ID3  GAACCGGAGAACGCTTCAGACCAGCCCGGAC

Or at least something organized in a similar manner. Unfortunately, whenever I try to use read.table, I'm forced to set fill = TRUE, to avoid the following error:

> read.table("ReadingText.txt", header=F, fill=F, sep=">")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 2 elements

Setting fill = TRUE doesn't solve the problem as it just introduces unwanted blank fields. I feel like my problem is that R wants to treat each new line from the input as a new row in the output, whereas I'm expecting it to start a new row only at each ">" and move to the next column of the same row at each new line of the input.

So, how would you get this to work? Is read.table just the wrong function to be trying to do this with or is there something else? Also, I'd really like to accomplish this without using any packages! I want to get a good grasp of the built-in functions in R.

Thanks for taking the time to read this and apologies if I've done anything wrong posting this here. This is the first time I've asked anything.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
Sam_M
  • 65
  • 1
  • 5
  • 1
    I'd probably just use `?readLines` or `?scan` to get all the data in, as `read.table` expects a neat one-row-per-line layout. You will need to do some post-processing once you have your data though, to pick out the odd and even values to make your two columns. – thelatemail Jan 05 '16 at 05:24
  • Adding to the comment by @thelatemail, you can read everything into a single column, and then generate the two columns you really want by accessing odd rows for the first column and then even rows for the second one. – Tim Biegeleisen Jan 05 '16 at 05:26
  • 1
    Possible duplicate of [how to read FASTA into dataframe and extract subsequences of FASTA file in R](http://stackoverflow.com/questions/21263636/how-to-read-fasta-into-dataframe-and-extract-subsequences-of-fasta-file-in-r) –  Jan 05 '16 at 05:28
  • The function (`Biostrings::readDNAStringSet`) utilized in the accepted answer of the dupe works for your example too. –  Jan 05 '16 at 05:34
  • @Pascal I've seen that post as well. I don't doubt that Biostrings would work, I'm just trying to get more familiar with some of the builtin functions in R. – Sam_M Jan 05 '16 at 05:39
  • 2
    @Sam_M No need to get stuck with base functions if you have tuned functions to read specific format. Simply a waste of time and energy. –  Jan 05 '16 at 05:40
  • @ Tim Biegeleisen How would you go about putting everything into a single column? – Sam_M Jan 05 '16 at 05:44

1 Answers1

5

It would take some tricky post-processing to do this with read.table() or readLines(). There is a function read.fasta() in the seqinr package that can get you most of the way there. Then we just turn the resulting list into a data frame.

library(seqinr)
(fasta <- read.fasta("so.fasta", set.attributes = FALSE, as.string = TRUE, forceDNAtolower = FALSE))
# $ID1
# [1] "AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC"
#
# $ID2
# [1] "TCCAATTAAGTCCCTATCCAGGCGCTCCG"
#
# $ID3
# [1] "GAACCGGAGAACGCTTCAGACCAGCCCGGAC"

setNames(rev(stack(fasta)), c("ID", "Sequence"))
#    ID                         Sequence
# 1 ID1 AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
# 2 ID2    TCCAATTAAGTCCCTATCCAGGCGCTCCG
# 3 ID3  GAACCGGAGAACGCTTCAGACCAGCCCGGAC

where the file so.fasta is

writeLines(">ID1
AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
>ID2
TCCAATTAAGTCCCTATCCAGGCGCTCCG
>ID3
GAACCGGAGAACGCTTCAGACCAGCCCGGAC", "so.fasta")

Note: Pascal makes a good point in the comments. When a tool already exists for your specific task, take advantage of that and use it. There is really no need to spend time trying to do this with functions that aren't right for the job when someone has already gone to the trouble to create this tool and shared it in a package to try to help other users attempting to solve the same problem.

Update: Actually, it's not that difficult using readLines(), so long as you have a nice clean file. Here is a possible solution using only base functions.

x <- readLines("so.fasta")
ids <- grepl("^>", x)
data.frame(ID = sub(">", "", x[ids]), Sequence = x[!ids])
#    ID                         Sequence
# 1 ID1 AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
# 2 ID2    TCCAATTAAGTCCCTATCCAGGCGCTCCG
# 3 ID3  GAACCGGAGAACGCTTCAGACCAGCCCGGAC
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Fair enough. I had myself convinced that this must have a simple answer to it, so I thought it'd be a good exercise to learn a little more about base R functions. Your answer worked perfectly though! Thanks a ton! – Sam_M Jan 05 '16 at 06:10