2

I am working on creating a new dataframe from a large three dimensional array using a nested R loop. I have tried running the code and either the job craps out after ~48 hours. The current code to perform the nested loop is shown below. I would really like to vectorize the loop to make it more efficient but am unsure how or if that's possible over a multi-dimensional array. Any suggestions for how to improve the efficiency of job is very much appreciated. For reference my_array is a small piece of my array with two slices. The data in the array is a probability value and the loop finds the founder with max probability value at a specific mouse&marker. The final output is a dataframe with mice names as rows, markers with columns, and the founder as the data. Example code is below.

    founder_names <- rownames(model.probs[1,,])
    mice_names <- rownames(model.probs[,1,])
    marker_names <- colnames(model.probs[1,,])

    # Create empty data frame
    probs.df <- data.frame()

    ## Instructions for nested loop

    for(marker in marker_names) {
      for(mouse in mice_names){
        probs.df[mouse, marker] = names(which.max(my_array[mouse,,marker]))
      }
    }

Example Data from dput(my_array):

structure(c(1.86334813592728e-08, 2.02070595143633e-10, 2.1558577630356e-08, 
2.1558577630356e-08, 2.04388477395613e-10, 2.04388477395593e-10, 
2.04388477395613e-10, 2.031707697502e-10, 2.04388477395593e-10, 
2.0317076975018e-10, 0.999999939150967, 1.19701878645413e-10, 
2.94522644878888e-08, 2.94522644878888e-08, 1.20988752710968e-10, 
1.20988752710968e-10, 1.20988752710968e-10, 1.20313358746148e-10, 
1.20988752710968e-10, 1.20313358746148e-10, 2.41632503275453e-12, 
2.53195197455819e-08, 2.89630046322804e-12, 2.89630046322804e-12, 
2.46380958026699e-08, 2.46380958026699e-08, 2.46380958026724e-08, 
2.44127737551662e-08, 2.46380958026699e-08, 2.44127737551638e-08, 
1.08633475857376e-12, 0.999999925628544, 1.30167423493078e-12, 
1.30167423493078e-12, 2.49445205965502e-08, 2.49445205965502e-08, 
2.49445205965527e-08, 2.47171256696929e-08, 2.49445205965502e-08, 
2.47171256696904e-08, 1.84322523200704e-08, 6.29795050516582e-11, 
2.13175870442828e-08, 2.13175870442849e-08, 6.40871335417646e-11, 
6.40871335417646e-11, 6.40871335417646e-11, 6.35035199711943e-11, 
6.40871335417646e-11, 6.3503519971188e-11, 0.999999939821495, 
2.75475678555388e-11, 2.91247770927105e-08, 2.91247770927134e-08, 
2.80325925630150e-11, 2.80325925630123e-11, 2.80325925630150e-11, 
2.77773153893157e-11, 2.80325925630123e-11, 2.77773153893129e-11, 
6.56947829427486e-13, 2.50477863870057e-08, 7.89281798086196e-13, 
7.89281798086277e-13, 2.43639980473783e-08, 2.43639980473783e-08, 
2.43639980473783e-08, 2.41399147887054e-08, 2.43639980473783e-08, 
2.4139914788703e-08, 1.7742262257411e-13, 0.999999926913761, 
2.13166988220277e-13, 2.13166988220277e-13, 2.46686866862984e-08, 
2.46686866862984e-08, 2.46686866863009e-08, 2.44425383948499e-08, 
2.46686866862984e-08, 2.44425383948499e-08), .Dim = c(10L, 4L, 
2L), .Dimnames = list(c("B6HER2", "X100", "X1002", "X1005", "X1006", 
    "X1007", "X1010", "X1011", "X1012", "X1014"), c("AI", "BI", "CI", 
    "DI"), c("UNC6", "JAX00000010")))
hacketju
  • 25
  • 4
  • Please share part of your data using `dput()` so others can help. See more here [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Tung Aug 08 '18 at 18:42
  • Also explain more what you are trying to do and what would be the desired output – Tung Aug 08 '18 at 18:43
  • Am I correct that `mice_names = c("AI", "BI", "CI", "DI")` and `marker_names = c("UNC6", "JAX00000010")`? If so, you should state that more clearly in the question. – divibisan Aug 08 '18 at 20:50

1 Answers1

2

the loop finds the founder with max probability value at a specific mouse&marker.

I'd maybe do...

# assign the dim names directly to the array:

names(dimnames(my_array)) <- c("founder", "mouse", "marker")

# enumerate combos with expand.grid(), not data.frame()

resdf = expand.grid(mouse = dimnames(my_array)$mouse, marker = dimnames(my_array)$marker)

# take maxes within slices

resdf$founder_max = dimnames(my_array)$founder[
  c(apply(my_array, c("mouse", "marker"), which.max))
]

  mouse      marker founder_max
1    AI        UNC6       X1002
2    BI        UNC6      B6HER2
3    CI        UNC6        X100
4    DI        UNC6        X100
5    AI JAX00000010       X1005
6    BI JAX00000010      B6HER2
7    CI JAX00000010        X100
8    DI JAX00000010        X100

Alternately, with reshape2:

library(reshape2)

resdf2 = melt(apply(my_array, c("mouse", "marker"), function(x) 
  dimnames(my_array)$founder[which.max(x)]
))

  mouse      marker  value
1    AI        UNC6  X1002
2    BI        UNC6 B6HER2
3    CI        UNC6   X100
4    DI        UNC6   X100
5    AI JAX00000010  X1005
6    BI JAX00000010 B6HER2
7    CI JAX00000010   X100
8    DI JAX00000010   X100

If you're still running into speed issues, there are alternatives to apply in, eg, the matrixStats package or you might write your own custom fast code with Rcpp. There might also be some way to manipulate your problem to use the fast max.col function in base ... though I don't immediately see it.


The final output is a dataframe with mice names as rows, markers with columns, and the founder as the data.

If you really want that format, you can stop after the apply:

apply(my_array, c("mouse", "marker"), function(x) 
  dimnames(my_array)$founder[which.max(x)]
)

     marker
mouse UNC6     JAX00000010
   AI "X1002"  "X1005"    
   BI "B6HER2" "B6HER2"   
   CI "X100"   "X100"     
   DI "X100"   "X100"  

This is a matrix, not a data.frame. I don't think it should be converted to a data.frame (except as melt does), but if you somehow need it, you can wrap in as.data.frame.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Thanks so much @Frank! That reduced processing time to a couple minutes and I have a good looking data set. – hacketju Aug 09 '18 at 14:00