2

I'm working on a large dataset at the moment and so far I could solve all my ideas/problems via countless google searches and long try & error sessions very well. I've managed to use plyr and reshape functions for some transformations of my different datasets and learned a lot, but I think I've reached a point where my present R knowledge won't help me anymore.

Even if my question sounds very specific (i.e. OTU table and fasta file) I guess my attempt is a common R application across many different fields (and not just bioinformatics).

Right now, I have merged an reference sequence file with an abundance table, and I would like to generate a specific file based on the information of this data.frame - a fasta file.

My df looks a bit like this at the moment:

repSeq     sw.1.102 sw.3.1021 sw.30.101 sw.5.1042 ...
ACCT-AGGA  3        0         1         0
ACCT-AGGG  1        1         2         0
ACTT-AGGG  0        1         0         25
...

The resulting file should look like this:

>sw.1.102_1
ACCT-AGGA
>sw.1.102_2
ACCT-AGGA
>sw.1.102_3
ACCT-AGGA
>sw.1.102_4
ACCT-AGGG
>sw.3.1021_1
ACCT-AGGG
>sw.3.1021_2
ACTT-AGGG
>sw.30.101_1
ACCT-AGGA
>sw.30.101_2
ACCT-AGGG
...

As you can see I would like to use the information about the number of (reference) sequences for each sample (i.e. sw.n) to create a (fasta) file.

I have no experiences with loops in R (I used basic loops only during simple processing attempts), but I assume this could do the trick here. I have found the write.fasta function from the SeqinR package, but I could not find any solution there. The deunique.seqs command in mothur wont work, because it needs a fasta file as input (which I obviously don't have). It could be very possible that there is something on Bioconductor (OTUbase?), but to be honest, I don't know where to beginn and I'm glad about any help. And I really would like to do this in R, since I enjoy working with it, but any other ideas are also very welcome.

//small edit:

Both answers below work very well (see my comments) - I also found two possible not-so-elegant & non-R workarounds (not tested yet):

  • since I already have a taxonomy file and an abundance OTU table, I think the mothur command make.biom could be used to create a biom-format file. I haven't worked with biom files yet, but I think there are some tools and scripts available to save the biom-file data as fasta again
  • convert Qiime files to oligotyping format - this also needs a taxonomy file and an Otu table

Not sure if both ways work - therefore, please correct me if I'm wrong.

www
  • 38,575
  • 12
  • 48
  • 84
plik
  • 67
  • 9

2 Answers2

2

Here's your data, coerced to a matrix (which is a more natural representation for rectangular data of homogeneous type).

df <- read.delim(textConnection(
    "repSeq     sw.1.102 sw.3.1021 sw.30.101 sw.5.1042
     ACCT-AGGA  3        0         1         0
     ACCT-AGGG  1        1         2         0
     ACTT-AGGG  0        1         0         25"
    ), sep="", row.names=1)
m <- as.matrix(df)

The tricky part is to figure out how to number the duplicated column name entries. I did this by creating sequences of the appropriate length and un-listing. I then created a matrix with two rows, the first (from replicating the colnames() as required by entries in the original matrix) is the id, and the second the sequence.

csum <- colSums(m)
idx <- unlist(lapply(csum, seq_len), use.names=FALSE)
res <- matrix(c(sprintf(">%s_%d", rep(colnames(m), csum), idx), # id
                rep(rownames(m)[row(m)], m)),                   # sequence
              nrow=2, byrow=TRUE)

Use writeLines(res, "your.fasta") to write out the results, or setNames(res[2,], res[1,]) to get a named vector of sequences.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Hi Martin, thank you for your answer and help. This also works (with some small trickery) very well. I used sink() to save the writeLines output and the resulting fasta file was quite large. The first >10000 lines were >X_n fasta names and followed by the corresponding sequence. After deleting the upper part of the file, the rest of the file was almost the same like above - different numeration but same amount/lenght. Hence it works! I don't know yet if this is caused by the sink() function or by my input file, but I will see if I get this working without additional textwrangling ;) Cheers – plik Mar 04 '15 at 09:32
  • @plik the second argument to `writeLines()` is the name of the file to write to; no need for `sink()`. Likely you concatenated results to the same file. – Martin Morgan Mar 04 '15 at 13:29
  • I still couldn't figure out why the final file has these above mentioned doublets. I think it's somehow due to the df to matrix conversion I've performed. Anyway, I would like to thank you again, because while both methods work for very well, your specific numbering turned out to be exactly what I needed for the subsequent oligotyping analysis I had in mind. Great :) – plik Mar 29 '15 at 17:17
1

Try this, it goes through the dataframe line by line and concatenates repetitions of sequences :

fasta_seq<-apply(df,1,function(x){
        p<-x[1]
        paste(unlist(mapply(function(x,y,z){
                if(as.numeric(y)>0) {paste(">",x,"_",(z+1):(z+y),"\n",p,"\n",sep="")}
        },colnames(df)[-1],as.numeric(x[-1]),c(0,lag(cumsum(as.numeric(x[-1])))[-1]),USE.NAMES=F)),collapse="")                
        })

write(paste(fasta_seq,collapse=""),"your_file.txt")
NicE
  • 21,165
  • 3
  • 51
  • 68
  • Great, this worked perfectly! Thank you so much for your help and work. It's late now, but I will spend tomorrow some time to understand what you did. I guess it's really time to learn how "function(x)" works. Until today I could easily bypass it with non-elegant approaches and/or textwrangler. Cheers! – plik Mar 03 '15 at 22:46