Imagine I have 30 sequences of some combinations of c("A", "G", "T") which are not all the same length. I'd like to find the frequency of how often A was in position 1, then position 2, up to the nth position (and repeat for all other letters).
E.g. here are 3 sequences containing A, G and T of different lengths labelled with an ID from 1 to 3. I apologise beforehand that I cannot work out why these sequences won't rbind
.
df<-data.frame(Sequences=rbind(sample(c("A","G","T"), size = 10, replace = TRUE),
sample(c("A","G","T"), size = 15, replace = TRUE),
sample(c("A","G","T"), size = 4, replace = TRUE)),
ID=rbind(rep(1:3,c(10,15,4))))
This returns the first 4 values in wide format. I can count each A, G and T in each column but I'm a bit stuck after that because some of sequences are longer than 4.
tmp<-aggregate(data=df,Sequence~ID,function(x)head(x,4))
Any help will be much appreciated eg using dplyr?
EDIT: Including dput of the data frame df.
dput(df)
structure(list(ActivityID = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("01",
"02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23",
"24", "25", "26", "27", "28", "29", "30"), class = "factor"),
nucl = c("A", "A", "G", "G", "G", "G", "G", "G", "G", "G",
"G", "G", "G", "G", "G", "G", "T", "G", "T", "G", "G", "G",
"G", "G", "A", "A", "A", "A", "A", "A", "G", "G", "T", "G",
"G", "G", "G", "G", "A", "G", "G", "T", "G", "G", "T", "A",
"A", "G", "G", "T")), row.names = c(NA, 50L), class = "data.frame")