2

Okay I will describe the real data instead of a reprex as I don't think it would make it any easier, but to clarify it all, this question involves a tiny biochemistry 101.

I work with DNA mutagenesis libraries, where certain DNA positions are randomised, which leads to proteins that therefore also have randomised amino acid positions. DNA consists of nucleotides (of which there are four, GATC) and an amino acid (of which there are 20, represented by letters) is encoded by a group of three nucleotides (a 'codon').

I have two vectors describing the codon - amino acid relationship:

cods <- c("GCT","GCC","GCA","GCG","CGT","CGC","CGA","CGG","AGA","AGG","AAT","AAC","GAT","GAC","TGT","TGC","CAA","CAG","GAA","GAG","GGT","GGC","GGA","GGG","CAT","CAC","TAA","TAG","TGA","ATT","ATC","ATA","CTT","CTC","CTA","CTG","TTA","TTG","AAA","AAG","ATG","TTT","TTC","CCT","CCC","CCA","CCG","TCT","TCC","TCA","TCG","AGT","AGC","ACT","ACC","ACA","ACG","TGG","TAT","TAC","GTT","GTC","GTA","GTG")
aas <- c("A","A","A","A","R","R","R","R","R","R","N","N","D","D","C","C","Q","Q","E","E","G","G","G","G","H","H","*","*","*","I","I","I","L","L","L","L","L","L","K","K","M","F","F","P","P","P","P","S","S","S","S","S","S","T","T","T","T","W","Y","Y","V","V","V","V")

The randomised positions allow certain nucleotides at a certain position in the codon and are indicated by a particular (irrelevant) letter, and so, for example the nucleotide codon "NYS" allows all four nucleotides (GATC) at the first position, but only AG at position two, and AC at position three. I now create all the possible triplets of the NYS and another library like so:

NYS <- expand.grid(list(c("A","C","G", "T"), c("C","T"), c("C","G")))
VRM <- expand.grid(list(c("A","C","G"), c("A","G"), c("A","C")))

and then I compute the corresponding amino acids of all these combinations:

# make codon triplet strings
NYS[,"cods"] <- paste(NYS$Var1, NYS$Var2, NYS$Var3, sep='')
VRM[,"cods"] <- paste(VRM$Var1, VRM$Var2, VRM$Var3, sep='')

#look them up in the aa vector and add a column
NYS[,"aas"] <- aas[match(NYS$cods, cods)]
VRM[,"aas"] <- aas[match(VRM$cods, cods)]

#get only the relevant columns
VRM <- VRM %>% select("aas", "cods")
NYS <- NYS %>% select("aas", "cods")
NYS$cods <- "NYS"
VRM$cods <- "VRM"

Now comes the tricky part: Depending on a certain input vector, describing the number and type of random codons, e.g. library_cods <- c("NYS", "VRM", "NYS", "NYS", "VRM", "VRM")

I now want to compute all the amino acid sequences that can be occur in these libraries. I then want to create a data frame that contains all the unique sequences and the occurrence count. I am doing it like this:

# make a string that contains n sort()s of the columns as determined by library_cods, evaluate, expand
all_combos <- expand.grid(lapply(str_split(paste(gsub("(...)", "sort(\\1\\$aas)", library_cods), collapse = ","), ",", simplify = T), function(x) eval(parse(text=x))))

# get the string from the rows
unique_seqs <- apply(all_combos, 1, function(x) paste(x, collapse = ""))

#rle() to count
unique_seqs <- data.frame(unclass(rle(sort(unique_seqs))))

#sort by count
unique_seqs <- unique_seqs[order(unique_seqs$lengths, decreasing = T),]

This works all fine, however, the issue is that it is really slow. So my main question is, how could I make this faster? On my system it takes 70s to execute the two lines doing rle() and the one after. This compares to sort -n file | uniq -c | sort -n in bash taking ~22s on the same data. Although that's better, it's still really slow, so I'm thinking maybe I should just do some maths instead of computing and counting ^^

As a side question; I also feel that my code is pretty clumsy (in partiuclar the all_combos <- line; I know it's really bad to evaluate a string as code); if anybody would want to point out how any of my code could be made more efficient, I would be grateful as well.

Anonymous
  • 183
  • 1
  • 9
  • you should take a look into the Biostrings library ! https://bioconductor.org/packages/release/bioc/html/Biostrings.html – user12256545 Feb 18 '21 at 15:57
  • 1
    You don't need to use `rle` here - all you are doing is counting the number of each value, so just use `sort(table(unique_seqs), decreasing = TRUE)` instead of your final two lines. – Andrew Gustar Feb 18 '21 at 16:02
  • 1
    And a more elegant way of doing your `all_combos` line would be to define a named list `lib_list <- list(NYS = NYS$aas, VRM = VRM$aas)`, and then `all_combos <- expand.grid(lib_list[library_cods])` – Andrew Gustar Feb 18 '21 at 16:10

1 Answers1

2

Some steps of your code can be made more concise. For the triplets just the vectors are needed, we fetch them later using mget.

NYS <- expand.grid(list(c("A", "C", "G", "T"), c("C", "T"), c("C", "G")))
VRM <- expand.grid(list(c("A", "C", "G"), c("A", "G"), c("A", "C")))

## triplets
NYS <- aas[match(Reduce(paste0, NYS), cods)]
VRM <- aas[match(Reduce(paste0, VRM), cods)]

## input vector
library_cods <- c("NYS", "VRM", "NYS", "NYS", "VRM", "VRM")

# columns as determined by library_cods, evaluate, expand
all_combos <- expand.grid(mget(library_cods))

# get the string from the rows
unique_seqs <- Reduce(paste0, all_combos)

# sort by count
unique_seqs <- data.frame(sort(table(unique_seqs), decreasing=T))

Result

head(unique_seqs)
#   unique_seqs Freq
# 1      LRLLRR  729
# 2      ARLLRR  486
# 3      LGLLRR  486
# 4      LRALRR  486
# 5      LRLARR  486
# 6      LRLLGR  486

Runs around 16 seconds on my system which is reasonable.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    @Max Except from the code shown, I just used your `cods` and `aas`. Try again in a fresh R session. – jay.sf Feb 18 '21 at 20:20
  • 1
    Works. Will delete my previous comment. Seems like I can even omit the `lapply( , sort)` step, as it will be sorted afterwards anyway. Thanks for this, runs very fast. I had to look up `mget()` and `Reduce()`. The former is immediately clear, but I found that your `Reduce(paste0, all_combos)` is (on my system) **ten times** faster than my `apply(all_combos, 1, function(x) paste0(x, collapse = ""))`. Can you explain why? – Anonymous Feb 19 '21 at 09:12
  • 1
    @Max `Reduce` internally uses a binary function to successively combine the characters, thus is vectorized, while `apply` loops through the lines, making a copy at each iteration. You're right, the `sort` is superfluous, edited. – jay.sf Feb 19 '21 at 12:37