1

I have 9 FASTA files, representing the DNA sequencing of 9 genes.

Each FASTA file contains 121 sequences ,representing 121 strains. The name for each sequence is the id for each strain.

However, in each file, the id is not sorted, for example, in gene1.fasta:

>1
AAA
>16
TTT
>2
GGG
...

In gene2.fasta:

>2
CCC
>34
AAA
>1
GGG
...

I want to change these 9 genes FASTA files into 121 strains FASTA files, in each file, simply combine 9 genes for one strain. For example, in strain1.fasta:

AAAGGG

in strain2.fasta:

GGGCCC

How can I do this in R?

Artem
  • 3,304
  • 3
  • 18
  • 41
Dong Yuan
  • 23
  • 7
  • Please add example of wanted result. Example of input would be helpful too: [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – pogibas Sep 29 '18 at 12:28
  • Thanks, I have edited the question. – Dong Yuan Sep 29 '18 at 13:12

1 Answers1

0

Here is a solution in R as requested, using the Biostrings package to read fasta files.

It works, but I have to say this is some of the ugliest code I have written in a long time. I just wanted to see if I could get this somehow done - this is 100% not the best solution.

library("Biostrings")
library("tidyverse")

convertStringSet = function(seq){
  return(df = data.frame("names" = names(seq), "seq" = paste(seq)))
}

# change the path accordingly
filenames = list.files("/home/x/y/z", pattern="gene*", full.names=TRUE)%>%
  lapply(readDNAStringSet)

fastaDF = filenames %>% lapply(convertStringSet) %>% 
  reduce(full_join, by = "names") %>% 
  unite("seq", -1,  sep="")

writeOutput = function(x){

  header = paste(">",x[1],sep="")
  fileName = paste("strain",x[1],".fasta",sep="")

  writeLines(c(header,x[2]), fileName)
}

apply(fastaDF, 1, writeOutput)

As an alternative, if you are on a unix system, this awk line should give the same results:

awk '$0 ~ /^>/ {i=substr($0,2); next;} i != -1 {printf "%s", $0 >> "file"i; i=-1;}' gene*
voiDnyx
  • 975
  • 1
  • 11
  • 24