Split the problem into sub-problems, solve them individually, and compose the solution.
The first subproblem is: how do I get codon frequencies of a given (in-frame) sequence? The answer is either to use a pre-made solution (e.g. Bioconductor’s Biostrings:: trinucleotideFrequency(…, steps = 3L)
), or something quick and dirty like the following:
codon_frequencies = function (seq) {
# Take care of incomplete codon at end.
len = nchar(seq) - (nchar(seq) %% 3L)
start = seq(1L, len, by = 3L)
substring(seq, start, start + 2L) |> table()
}
Try it:
codon_frequencies('TGGAGCCTCGCTC')
#
# AGC CTC GCT TGG
# 1 1 1 1
… incidentally, is it intentional that your sequences have fragmentary codons? If so, are you sure they always start on a full codon?
OK. The next step is calling this function for each gene ID in your table, and collecting the results. At this point, we’re helped by the fact that a counts table can be converted to a tidy data frame:
data.frame(codon_frequencies('TGGAGCCTCGCTC'))
# Var1 Freq
# 1 AGC 1
# 2 CTC 1
# 3 GCT 1
# 4 TGG 1
For our purposes, this is a convenient format, because it makes table manipulation easier (especially when working in tidy data format, which I’m doing in the following using the packages ‘dplyr’, ‘tidyr’ and ‘purrr’):
df |>
group_by(`Gene ID`) |>
summarize(map_dfr(TM_domain_Seq, ~ data.frame(codon_frequencies(.x))))
# # A tibble: 20 × 3
# # Groups: Gene ID [1]
# `Gene ID` Var1 Freq
# <chr> <fct> <int>
# 1 ENSG00000003989 AGC 1
# 2 ENSG00000003989 CTC 1
# 3 ENSG00000003989 GCT 1
# 4 ENSG00000003989 TGG 1
# …
At this point we could probably call it a day: this is a convenient format to work with. However, if you prefer, you can also pivot the data into wide format:
… |>
pivot_wider(
id_cols = `Gene ID`,
names_from = Var1,
values_from = Freq,
values_fill = 0L # Otherwise missing codons will be `NA`
)
# # A tibble: 5 × 11
# # Groups: Gene ID [5]
# `Gene ID` AGC CTC GCA TGG TGA GCG AGT GAT TAC GCT
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ENSG00000003981 1 1 1 1 0 0 0 0 0 0
# 2 ENSG00000003982 1 1 0 1 1 0 0 0 0 0
# 3 ENSG00000003983 1 1 0 1 0 1 0 0 0 0
# 4 ENSG00000003984 0 0 0 0 0 0 1 1 1 0
# 5 ENSG00000003989 1 1 0 1 0 0 0 0 0 1
(This is using some different toy data.)
Finally, if you want to have columns for all codons, even those not present in your data, you can make a small modification to the codon_frequencies
function:
all_codons = c('A', 'C', 'G', 'T') %>% expand.grid(., ., .) |> apply(1L, paste, collapse = '')
codon_frequencies = function (seq, all = FALSE) {
# Take care of incomplete codon at end.
len = nchar(seq) - (nchar(seq) %% 3L)
start = seq(1L, len, by = 3L)
codons = substring(seq, start, start + 2L)
table(if (all) factor(codons, levels = all_codons) else codons)
}
And then call it as codon_frequencies(.x, all = TRUE)
in the code above. The pivot_wider
no longer needs the values_fill = 0L
argument then.
Putting it all together:
df |>
group_by(`Gene ID`) |>
summarize(
map_dfr(TM_domain_Seq, ~ data.frame(codon_frequencies(.x, all = TRUE))),
.groups = 'drop'
) |>
pivot_wider(
id_cols = `Gene ID`,
names_from = Var1,
values_from = Freq
)