0

I am trying to use dcast to get nucleotide frequencies from a long format to wide format, like below:

res <- read.table(text='seqnames    pos strand  nucleotide  count   which_label V3  REF
1   134199222   -   A   NA  1:134199222-134199222   ENSMUST00000086465  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000169927  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000038191  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000086465  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000169927  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000038191  T',header=T)

> res
seqnames       pos strand nucleotide count           which_label                 V3  REF
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000086465 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000169927 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000038191 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000086465 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000169927 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000038191 TRUE

# change the levels so that even if there is no information, we get an output
res$strand <- factor(res$strand,levels=c('-','+'))
res$nucleotide <- factor(res$nucleotide,levels=c('A','T','G','C'))
res$seqnames <- factor(res$seqnames, levels=unique(res$seqnames))

# convert NAs to 0
# do not drop any missing rows
# get results for all possible nucleotide and strand even if absent
results <- dcast(res, seqnames+pos+V3~nucleotide+strand,
                 value.var = "count", fill = 0, drop=FALSE)

*Aggregation function missing: defaulting to length*

# results object looks like this

seqnames       pos                 V3 A_- A_+ T_- T_+ G_- G_+ C_- C_+
       1 134199222 ENSMUST00000038191   2   0   0   0   0   0   0   0
       1 134199222 ENSMUST00000086465   2   0   0   0   0   0   0   0
       1 134199222 ENSMUST00000169927   2   0   0   0   0   0   0   0

As you can see dcast by default calculates length and outputs 2 in A_- whereas I want 0 because there are NAs in the data frame. I expect something like this:

seqnames       pos                 V3 A_- A_+ T_- T_+ G_- G_+ C_- C_+
       1 134199222 ENSMUST00000038191   0   0   0   0   0   0   0   0
       1 134199222 ENSMUST00000086465   0   0   0   0   0   0   0   0
       1 134199222 ENSMUST00000169927   0   0   0   0   0   0   0   0

Even though I am using value.var = "count" why is it still aggregating by length? Any help would be much appreciated!

Thanks!

Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • It could be also a dupe of [this](http://stackoverflow.com/questions/12831524/can-dcast-be-used-without-an-aggregate-function) – akrun Feb 11 '16 at 17:48
  • I did look at the questions before posting this. I am using value.var because I just want the value to be present and no function to be applied but it is still aggregating by length. Also, I do have a reproducible example showing what my problem is. – Komal Rathi Feb 11 '16 at 17:50
  • The `read.delim` line is reading as a single column – akrun Feb 11 '16 at 17:52
  • fixed. it shouldve been `read.table` – Komal Rathi Feb 11 '16 at 17:53
  • Actually, there are 2 elements for `A -` based on the combiantions of other groups. Also, if this should be all 0's, I don't understand why this should be done. – akrun Feb 11 '16 at 17:57
  • Gotcha, duplicate rows. You can close this question as duplicate. Sorry!! – Komal Rathi Feb 11 '16 at 18:02

0 Answers0