I am trying to use dcast to get nucleotide frequencies from a long format to wide format, like below:
res <- read.table(text='seqnames pos strand nucleotide count which_label V3 REF
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000086465 T
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000169927 T
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000038191 T
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000086465 T
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000169927 T
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000038191 T',header=T)
> res
seqnames pos strand nucleotide count which_label V3 REF
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000086465 TRUE
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000169927 TRUE
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000038191 TRUE
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000086465 TRUE
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000169927 TRUE
1 134199222 - A NA 1:134199222-134199222 ENSMUST00000038191 TRUE
# change the levels so that even if there is no information, we get an output
res$strand <- factor(res$strand,levels=c('-','+'))
res$nucleotide <- factor(res$nucleotide,levels=c('A','T','G','C'))
res$seqnames <- factor(res$seqnames, levels=unique(res$seqnames))
# convert NAs to 0
# do not drop any missing rows
# get results for all possible nucleotide and strand even if absent
results <- dcast(res, seqnames+pos+V3~nucleotide+strand,
value.var = "count", fill = 0, drop=FALSE)
*Aggregation function missing: defaulting to length*
# results object looks like this
seqnames pos V3 A_- A_+ T_- T_+ G_- G_+ C_- C_+
1 134199222 ENSMUST00000038191 2 0 0 0 0 0 0 0
1 134199222 ENSMUST00000086465 2 0 0 0 0 0 0 0
1 134199222 ENSMUST00000169927 2 0 0 0 0 0 0 0
As you can see dcast by default calculates length and outputs 2 in A_- whereas I want 0 because there are NAs in the data frame. I expect something like this:
seqnames pos V3 A_- A_+ T_- T_+ G_- G_+ C_- C_+
1 134199222 ENSMUST00000038191 0 0 0 0 0 0 0 0
1 134199222 ENSMUST00000086465 0 0 0 0 0 0 0 0
1 134199222 ENSMUST00000169927 0 0 0 0 0 0 0 0
Even though I am using value.var = "count"
why is it still aggregating by length? Any help would be much appreciated!
Thanks!