0

I have this data matrix called mymat. It has got .GT columns for samples 00860 and 00861 . I want to expand this matrix with new .AD column. The corresponding .AD columns for each sample will have values 50,0 if .GT is 0/0, 25/25 if .GT is 0/1 and 0,50 if .GT is 1/1. I also want to add another column called .DP next to each column which will have 50 across the column and get the result. How can I do this kind of conditional expansion of matrix in R?

mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), .Dimnames = list(
c("chr1:1163804", "chr1:1888193"
), c("00860.GT", "00861.GT")))

result:

           00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 0/1      25/25       50      0/0     50,0     50
chr1:1888193 1/1      0/50        50      0/0     50,0     50
MAPK
  • 5,635
  • 4
  • 37
  • 88

2 Answers2

1

There may be a better way, but one way to do this using dplyr is:

library(dplyr)

set.AD <- function(x) {                                                   ## 1.
  if (x=="0/0") {
    return("50/0")
  } else if (x=="0/1") {
    return("25/25")
  } else {
    return("0/50")
  }
}
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)                        ## 2.
rnames <- rownames(mymat)
out = mymat %>% group_by(ID)                                              ## 3.
            %>% mutate(`X00860.AD`=set.AD(`X00860.GT`), `X00860.DP`=50,
                       `X00861.AD`=set.AD(`X00861.GT`), `X00861.DP`=50)
out <- data.frame(out[,-1])                                               ## 4.
rownames(out) <- rnames

Notes:

  1. Define a function that sets the AD column from the GT column based on your logic.
  2. Convert your data to a data frame, adding a unique identifier column so that we can apply the function to each row using group_by. Also preserve the row names.
  3. Use mutate to create the AD and DP columns for both the X00860.GT and X00861.GT columns. Note that the conversion to data frame prepends the column names with X because R does not like column names that start with a number. See this SO answer for an explanation.

At this point what is returned is a tibble. Therefore,

  1. Remove the ID column, convert to a data frame, and add back the row names.

The result on your data is:

print(out)
##             X00860.GT X00861.GT X00860.AD X00860.DP X00861.AD X00861.DP
##chr1:1163804       0/1       0/0     25/25        50      50/0        50
##chr1:1888193       1/1       0/0      0/50        50      50/0        50

To reorder the columns to match your output, you can simply:

out <- out[,c(1,3,4,2,5,6)]
##             X00860.GT X00860.AD X00860.DP X00861.GT X00861.AD X00861.DP
##chr1:1163804       0/1     25/25        50       0/0      50/0        50
##chr1:1888193       1/1      0/50        50       0/0      50/0        50

Obviously, this approach can only handle your two columns, but can handle any number of rows.


Edited to handle any number of columns (samples)

Notes are given as comments

# keep column and row names of original mymat to use later
cnames <- colnames(mymat)
rnames <- rownames(mymat)
# since DP columns are always 50, we just create a data frame filled with 50
# to bind to the result as additional columns
dp <- data.frame(matrix(rep(50,ncol(mymat)*nrow(mymat)), nrow=nrow(mymat), ncol=ncol(mymat)))
# set the column name to that of mymat
colnames(dp) <- cnames
# convert to data frame and augment with ID as before
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)
# the difference here is that we use mutate_each to apply set.AD to each
# (and all) column of the input. This is done in-place. We then bind the 
# original mymat and dp as columns to this result
out <- mymat %>% group_by(ID) 
             %>% mutate_each(funs(set.AD)) 
             %>% ungroup() %>% select(-ID) 
             %>% bind_cols(mymat[,-1],.) %>% bind_cols(dp)
# At this point, we have the original mymat columns followed by the 
# AD columns followed by the DP columns. The following uses a matrix 
# transpose trick to resort the columns to what you want
col.order <- as.vector(t(matrix(seq_len(ncol(out)), nrow=ncol(mymat)-1, ncol=3)))
out <- data.frame(out[,col.order])
# finally, use gsub to change the column names for the AD and DP columns,
# get rid of the 'X' in the column names, and add back the row names
colnames(out) <- gsub("X", "", gsub("GT.1", "AD", gsub("GT.2", "DP", colnames(out))))
rownames(out) <- rnames
print(out)
##             00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
##chr1:1163804      0/1    25/25       50      0/0     50/0       50
##chr1:1888193      1/1     0/50       50      0/0     50/0       50

Hope this helps.

Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18
  • Thanks. This is useful,but what if there are more than 2000 samples? – MAPK Aug 03 '16 at 02:26
  • 1
    You mean columns. No, this will not work for that. If you need that, then I can see if the answer can be edited. – aichao Aug 03 '16 at 02:28
1

Here's a data.table solution, with each line commented. It is written to handle any number of columns in your mymat object. I will explain briefly:

1) First, we convert to a data.table format where we can handle any number of columns, assuming it will be in a similar format.

2) We find all of the ".GT" columns and extract the number before the ".GT".

3) We create ".DP" columns for each ".GT" column found.

4) We develop a "GT" to "AD" mapping by creating a vector of the "to" part of the mapping. The "from" part is stored as names in the vector.

5) Use the .SDcols feature in the data.table to apply the "GT" to "AD" mapping, and create the "AD" columns.

# Your matrix
mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), 
                   .Dimnames = list(c("chr1:1163804", "chr1:1888193"), 
                    c("00860.GT", "00861.GT")))

# Using a data table approach
library(data.table)

# Casting to data table - row.names will be converted to a column called 'rn'.
mymat = as.data.table(mymat, keep.rownames = T)

# Find "GT" columns
GTcols = grep("GT", colnames(mymat))

# Get number before ".GT"
selectedCols = gsub(".GT", "", colnames(mymat)[GTcols])

selectedCols
[1] "00860" "00861"

# Create ".DP" columns
mymat[, paste0(selectedCols, ".DP") := 50, with = F]

mymat
             rn 00860.GT 00861.GT 00860.DP 00861.DP
1: chr1:1163804      0/1      0/0       50       50
2: chr1:1888193      1/1      0/0       50       50

# Create "GT" to "AD" mapping
GTToADMapping = c("50,0", "25/25", "0/50")
names(GTToADMapping) = c("0/0", "0/1", "1/1")

GTToADMapping
0/0     0/1     1/1 
"50,0" "25/25"  "0/50" 

# This function will return the "AD" mapping given the values of "GT"
mapGTToAD <- function(x){
  return (GTToADMapping[x])
}

# Here, we create the AD columns using the GT mapping
mymat[, (paste0(selectedCols, ".AD")) := lapply(.SD, mapGTToAD), with = F,
        .SDcols = colnames(mymat)[GTcols]]

             rn 00860.GT 00861.GT 00860.DP 00861.DP 00860.AD 00861.AD
1: chr1:1163804      0/1      0/0       50       50    25/25     50,0
2: chr1:1888193      1/1      0/0       50       50     0/50     50,0

# We can sort the data now as you have it
colOrder = as.vector(rbind(paste0(selectedCols, ".GT"), 
                     paste0(selectedCols, ".AD"), 
                     paste0(selectedCols, ".DP")))
mymat = mymat[, c("rn", colOrder), with = F]

mymat
             rn 00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
1: chr1:1163804      0/1    25/25       50      0/0     50,0       50
2: chr1:1888193      1/1     0/50       50      0/0     50,0       50

# Put it back in the format you had
mymat2 = as.matrix(mymat[,-1, with = F])
rownames(mymat2) = mymat$rn

mymat2
             00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 "0/1"    "25/25"  "50"     "0/0"    "50,0"   "50"    
chr1:1888193 "1/1"    "0/50"   "50"     "0/0"    "50,0"   "50"    
jav
  • 1,485
  • 8
  • 11
  • Thanks. This is awesome. – MAPK Aug 03 '16 at 02:33
  • What is this warning :`> mymat[, paste0(selectedCols, ".DP") := 50, with = F] Warning message: In `[.data.table`(mymat, , `:=`(paste0(selectedCols, ".DP"), 50), : truelength (3141) is greater than 1000 items over-allocated (length = 1055). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().` – MAPK Aug 03 '16 at 02:39
  • This warning is new to me. Please try the solution given here (using the `alloc.col` function): http://stackoverflow.com/questions/29615181/r-warning-when-creating-a-long-list-of-dummies and let us know if the warning still occurs. – jav Aug 03 '16 at 02:46
  • Still getting the warning: `Warning messages: 1: In `[.data.table`(mymat, , `:=`(paste0(selectedCols, ".DP"), 50), : truelength (3141) is greater than 1000 items over-allocated (length = 1055). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo(). 2: In alloc.col(mymat[, `:=`(paste0(selectedCols, ".DP"), 50), with = F]) : tl (3141) is greater than 1000 items over-allocated (l = 2098). If you didn't set the datatable.alloccol option to be very large, please report thi.` – MAPK Aug 03 '16 at 02:55
  • If I figure out how to `alloc.col` function, I will let you know. In the meantime, I believe that the code will produce what you want. As per the stackoverflow site linked above, this warning appears to me to suggest setting `alloc.col` in advance so that the data.table performance is better. If you do not see any problems in performance of the code above, then this code should be able to be used as is. – jav Aug 03 '16 at 03:09
  • 1
    @MAPK: in case you still care, I've updated my post. It now should handle any number of columns and rows. – aichao Aug 03 '16 at 04:01