R: Calculate frequencies of a A, G, T at the nth position over multiple sequence

Question

Imagine I have 30 sequences of some combinations of c("A", "G", "T") which are not all the same length. I'd like to find the frequency of how often A was in position 1, then position 2, up to the nth position (and repeat for all other letters).

E.g. here are 3 sequences containing A, G and T of different lengths labelled with an ID from 1 to 3. I apologise beforehand that I cannot work out why these sequences won't rbind.

df<-data.frame(Sequences=rbind(sample(c("A","G","T"), size = 10, replace = TRUE),
                              sample(c("A","G","T"), size = 15, replace = TRUE),
                              sample(c("A","G","T"), size = 4, replace = TRUE)),
              ID=rbind(rep(1:3,c(10,15,4))))

This returns the first 4 values in wide format. I can count each A, G and T in each column but I'm a bit stuck after that because some of sequences are longer than 4.

tmp<-aggregate(data=df,Sequence~ID,function(x)head(x,4))

Any help will be much appreciated eg using dplyr?

EDIT: Including dput of the data frame df.

dput(df)
structure(list(ActivityID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("01", 
"02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", 
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", 
"24", "25", "26", "27", "28", "29", "30"), class = "factor"), 
    nucl = c("A", "A", "G", "G", "G", "G", "G", "G", "G", "G", 
    "G", "G", "G", "G", "G", "G", "T", "G", "T", "G", "G", "G", 
    "G", "G", "A", "A", "A", "A", "A", "A", "G", "G", "T", "G", 
    "G", "G", "G", "G", "A", "G", "G", "T", "G", "G", "T", "A", 
    "A", "G", "G", "T")), row.names = c(NA, 50L), class = "data.frame")

@user2974951 They are different lengths but can I not stack them on top of each other? — HCAI, Aug 21 '19 at 13:37
@HCAI Yes but not like that, also my version in my answer is a better option, easier to compare. — user2974951, Aug 21 '19 at 13:42

score 2 · Answer 1 · answered Aug 21 '19 at 11:56

I slightly changed your code since it was wrong, here is my result

> df<-data.frame(cbind(c(sample(c("A","G","T"), size = 10, replace = TRUE), rep(NA,5)),
                       sample(c("A","G","T"), size = 15, replace = TRUE),
                       c(sample(c("A","G","T"), size = 4, replace = TRUE), rep(NA,11))))
> apply(df,1,function(x){mean(x=="A",na.rm=T)})
 [1] 0.3333333 0.3333333 0.0000000 1.0000000 0.0000000 0.5000000 0.5000000
 [8] 0.0000000 1.0000000 0.5000000 0.0000000 1.0000000 1.0000000 1.0000000
[15] 0.0000000

Which returns proportions, if you want frequencies use sum instead.

Arienrhod · Accepted Answer · 2019-08-21T15:08:26.477

1

If you want to keep your sequences as rows as in your proposed input, you can do the following using dplyr and purrr functions:

nucl <- c("A","G","T")
df <- data.frame(rbind(c(sample(nucl, size = 10, replace = TRUE), rep(NA,5)),
                       sample(nucl, size = 15, replace = TRUE),
                       c(sample(nucl, size = 4, replace = TRUE), rep(NA,11))))
out <- nucl %>% 
    map_df(function(x) summarise_all(df, ~mean(. == x, na.rm=TRUE)), .id="nucl_id") %>% 
    mutate(nucl_id = nucl[as.numeric(nucl_id)])

This will produce a data frame where the first columns informs you of the nucleotide in question, while the other columns give you a proportion of the nucleotide in each position. You can also get the whole thing as a list of data frames by using:

out <- nucl %>% 
    map(function(x) summarise_all(df, ~mean(. == x, na.rm=TRUE))) %>% 
    set_names(nucl)

EDIT: Based on your data input, you can first spread your data to the wide format based on the ActivityID:

df_wide <- df %>%
    group_by(ActivityID) %>% 
    mutate(position = paste0("pos", formatC(seq(1:n()), width=2, flag="0"))) %>% 
    spread(position, nucl) %>% 
    ungroup()

And then get the proportions per each position.

out <- nucl %>% 
    map_df(function(x) summarise_all(select(df_wide, -ActivityID), ~mean(. == x, na.rm=TRUE)), .id="nucl_id") %>% 
    mutate(nucl_id = nucl[as.numeric(nucl_id)])

You have to decide for yourself whether you want to keep na.rm=TRUE or not, because in cases of longer sequences, it will seem like all of them have a certain letter there.

edited Aug 21 '19 at 15:08

answered Aug 21 '19 at 12:47

Arienrhod

2,451
1
11
19

Thank you very much for this. My data is in long format with one sequence stacked on top of the next with a column of ID to indicate which sequence each row element belongs to. That's why I can't work out why rbind won't stack columns of unequal lengths. – HCAI Aug 21 '19 at 13:37
It will, but only if you fill the rest of each sequence with NAs (like in the toy example above). What's the exact format of your input? Can you give us an example with `dput`? – Arienrhod Aug 21 '19 at 13:47
Thank you Arienrhod, I have included a simple dput of df. Does it make better sense of the data format now? I apologise for the confusion. – HCAI Aug 21 '19 at 14:05
I've added an edit more tailored to your data structure. – Arienrhod Aug 21 '19 at 15:13
Thank you very much indeed, this works very nicely! By the way, what's the deal with the rbind. Surely it should stack the vectors regardless of their length. Am I missing something simple? – HCAI Aug 21 '19 at 20:54
1

no, `rbind` fails with vectors of different lengths, but `bind_rows` should do the job and add missing values as discussed [here](https://stackoverflow.com/q/42887217/11764521). – Arienrhod Aug 21 '19 at 22:37

R: Calculate frequencies of a A, G, T at the nth position over multiple sequence

EDIT: Including dput of the data frame df.

2 Answers2