Okay I will describe the real data instead of a reprex as I don't think it would make it any easier, but to clarify it all, this question involves a tiny biochemistry 101.
I work with DNA mutagenesis libraries, where certain DNA positions are randomised, which leads to proteins that therefore also have randomised amino acid positions. DNA consists of nucleotides (of which there are four, GATC) and an amino acid (of which there are 20, represented by letters) is encoded by a group of three nucleotides (a 'codon').
I have two vectors describing the codon - amino acid relationship:
cods <- c("GCT","GCC","GCA","GCG","CGT","CGC","CGA","CGG","AGA","AGG","AAT","AAC","GAT","GAC","TGT","TGC","CAA","CAG","GAA","GAG","GGT","GGC","GGA","GGG","CAT","CAC","TAA","TAG","TGA","ATT","ATC","ATA","CTT","CTC","CTA","CTG","TTA","TTG","AAA","AAG","ATG","TTT","TTC","CCT","CCC","CCA","CCG","TCT","TCC","TCA","TCG","AGT","AGC","ACT","ACC","ACA","ACG","TGG","TAT","TAC","GTT","GTC","GTA","GTG")
aas <- c("A","A","A","A","R","R","R","R","R","R","N","N","D","D","C","C","Q","Q","E","E","G","G","G","G","H","H","*","*","*","I","I","I","L","L","L","L","L","L","K","K","M","F","F","P","P","P","P","S","S","S","S","S","S","T","T","T","T","W","Y","Y","V","V","V","V")
The randomised positions allow certain nucleotides at a certain position in the codon and are indicated by a particular (irrelevant) letter, and so, for example the nucleotide codon "NYS" allows all four nucleotides (GATC) at the first position, but only AG at position two, and AC at position three. I now create all the possible triplets of the NYS and another library like so:
NYS <- expand.grid(list(c("A","C","G", "T"), c("C","T"), c("C","G")))
VRM <- expand.grid(list(c("A","C","G"), c("A","G"), c("A","C")))
and then I compute the corresponding amino acids of all these combinations:
# make codon triplet strings
NYS[,"cods"] <- paste(NYS$Var1, NYS$Var2, NYS$Var3, sep='')
VRM[,"cods"] <- paste(VRM$Var1, VRM$Var2, VRM$Var3, sep='')
#look them up in the aa vector and add a column
NYS[,"aas"] <- aas[match(NYS$cods, cods)]
VRM[,"aas"] <- aas[match(VRM$cods, cods)]
#get only the relevant columns
VRM <- VRM %>% select("aas", "cods")
NYS <- NYS %>% select("aas", "cods")
NYS$cods <- "NYS"
VRM$cods <- "VRM"
Now comes the tricky part: Depending on a certain input vector, describing the number and type of random codons, e.g.
library_cods <- c("NYS", "VRM", "NYS", "NYS", "VRM", "VRM")
I now want to compute all the amino acid sequences that can be occur in these libraries. I then want to create a data frame that contains all the unique sequences and the occurrence count. I am doing it like this:
# make a string that contains n sort()s of the columns as determined by library_cods, evaluate, expand
all_combos <- expand.grid(lapply(str_split(paste(gsub("(...)", "sort(\\1\\$aas)", library_cods), collapse = ","), ",", simplify = T), function(x) eval(parse(text=x))))
# get the string from the rows
unique_seqs <- apply(all_combos, 1, function(x) paste(x, collapse = ""))
#rle() to count
unique_seqs <- data.frame(unclass(rle(sort(unique_seqs))))
#sort by count
unique_seqs <- unique_seqs[order(unique_seqs$lengths, decreasing = T),]
This works all fine, however, the issue is that it is really slow. So my main question is, how could I make this faster?
On my system it takes 70s to execute the two lines doing rle() and the one after. This compares to sort -n file | uniq -c | sort -n
in bash taking ~22s on the same data. Although that's better, it's still really slow, so I'm thinking maybe I should just do some maths instead of computing and counting ^^
As a side question; I also feel that my code is pretty clumsy (in partiuclar the all_combos <-
line; I know it's really bad to evaluate a string as code); if anybody would want to point out how any of my code could be made more efficient, I would be grateful as well.