0

My starting table looks like this:

   CHROM     POS                       REF                                 ALT  GT
1:     1   58211                         A                                   G 1/1
2:     1 6464767 CAAATAAATAAATAAATAAATAAAT C,CAAATAAATAAATAAATAAATAAATAAATAAAT 1/2
3:    12   83011                         T                                   C 0/1
4:    18 1541042                         C                                 T,A 1/2

I want to apply a function "ap2" that will split the long REF and ALT entries on line 2 into two shorter entries, updating the data on line 2 (changing REF, ALT and GT) and inserting a new row (#3 with new POS, ALT and GT). The results will look like this:

   CHROM     POS                       REF                                 ALT  GT
1:     1   58211                         A                                   G 1/1
2:     1 6464767 CAAATAAATAAATAAATAAATAAAT                                   C 1/2
3:     1 6464791                         T                           TAAATAAAT 1/2
4:    12   83011                         T                                   C 0/1
5:    18 1541042                         C                                 T,A 1/2

If I run the ap2 function it displays the expected results (columns V1-V4):

tmp[,ap2(POS,REF,ALT,GT), by=c("CHROM","POS","REF","ALT","GT")]
   CHROM     POS                       REF                                 ALT  GT      V1                        V2        V3  V4
1:     1 6464767 CAAATAAATAAATAAATAAATAAAT C,CAAATAAATAAATAAATAAATAAATAAATAAAT 1/2 6464767 CAAATAAATAAATAAATAAATAAAT         C 0/1
2:     1 6464767 CAAATAAATAAATAAATAAATAAAT C,CAAATAAATAAATAAATAAATAAATAAATAAAT 1/2 6464791                         T TAAATAAAT 0/1
3:    18 1541042                         C                                 T,A 1/2 1541042                         C       T,A 1/2

However, if I try to update the original columns I get errors:

tmp[, c("POS","REF","ALT","GT") := ap2(POS,REF,ALT,GT), by=c("CHROM","POS","REF","ALT","GT")]
Warning messages:
1: In `[.data.table`(tmp, , `:=`(c("POS", "REF", "ALT", "GT"), ap2(POS,  :
  RHS 1 is length 2 (greater than the size (1) of group 2). The last 1 element(s) will be discarded.
2: In `[.data.table`(tmp, , `:=`(c("POS", "REF", "ALT", "GT"), ap2(POS,  :
  RHS 2 is length 2 (greater than the size (1) of group 2). The last 1 element(s) will be discarded.
3: In `[.data.table`(tmp, , `:=`(c("POS", "REF", "ALT", "GT"), ap2(POS,  :
  RHS 3 is length 2 (greater than the size (1) of group 2). The last 1 element(s) will be discarded.
4: In `[.data.table`(tmp, , `:=`(c("POS", "REF", "ALT", "GT"), ap2(POS,  :
  RHS 4 is length 2 (greater than the size (1) of group 2). The last 1 element(s) will be discarded.

Here is the code to create my tmp data.table

data.table(
  CHROM = as.character(c("1","1","12","18")) ,
  POS = as.integer(c(58211,6464767,83011,1541042)) ,
  REF = c("A","CAAATAAATAAATAAATAAATAAAT","T","C") ,
  ALT = c("G","C,CAAATAAATAAATAAATAAATAAATAAATAAAT","C","T,A") ,
  GT = c("1/1","1/2","0/1","1/2")
)

And this is the function I am trying to apply:

ap2 <- function(pos,ref,alt,gt) {
  if(gt=="1/2") {
    alt.split <- unlist(strsplit(alt,","))
    matching <- attr(regexpr(ref,alt.split), "match.length")
    if(max(matching) == -1) {
      list(pos,ref,alt,gt)
    } else {
      alt.new <- NULL
      ref.new <- NULL
      pos.new <- NULL
      gt.new <- NULL
      for(i in 1:length(matching)) {
        stopPos <- matching[i]
        if(stopPos == -1) {
          pos.new <- c(pos.new,as.integer(pos))
          ref.new <- c(ref.new,ref)
          alt.new <- c(alt.new,alt.split[i])
        } else {
          pos.new <- c(pos.new, as.integer(pos+matching[i]-1))
          ref.new <- c(ref.new, substring(ref,stopPos))
          alt.new <- c(alt.new, substring(alt.split[i],stopPos))
        }
        gt.new <- c(gt.new, "0/1")
      }
      list(pos.new, ref.new, alt.new, gt.new)
    }
  }
}
Pete
  • 323
  • 2
  • 12
  • 2
    I'm pretty sure you can't use `:=` if you're changing the number of rows. Just modify `ap2` so that it returns the row unchanged if it doesn't need to be split in addition to what it is currently doing. You will pretty much get what you're after, with the unfortunate necessity of copying the table. – BrodieG Dec 18 '14 at 19:03
  • @BrodieG according to this thread it is possible http://stackoverflow.com/questions/15347282/split-string-and-insert-as-new-rows. In the example above try: "tmp[ , list(ALTSPLIT = unlist(strsplit(ALT,","))), by=eval(colnames(tmp))]". I just can't work our how to do it to multiple columns and replace the original columsns. – Pete Dec 19 '14 at 10:59
  • Arun isn't modifying the table by reference in that example. He is making a copy. Notice how there isn't a `:=` in that answer. – BrodieG Dec 19 '14 at 13:43

0 Answers0