2

I am trying to create an R function which selects an entry at random among those entries that have values equal to the maximum and doing this row wise. The trick is that once i select the column for a given row, i no longer want that column to be considered for selection for subsequent rows. I also want to know how many columns had entries that were equal to the rowwise maximum and exactly what that max value was for the row. I have tried many variations on the theme and here's my code as it stands now. The largemat is a large matrix 5000 rows by 20000 columns. I tried to vectorize this but the issue is that it's a dynamic process so the results for row 2 depend on which column was selected for row 1. So i can't just pick row maxes at once because they could change.

Here's an example of the first two rows:

Row 1: .5, .5, 1, 1 Row 2: .6, .8, .7, .9

So i know that the rowmax for Row 1 is 1 and the row max for row 2 is .9. But if i select the fourth column (from the third and fourth from row 1) then I remove that column from possible selection for row 2 (which now has candidates .6, .8, .7)

I am struggling with how to make this more efficient. Any advice would be appreciated. You all are the masters and I am trying to become one. So any advice is so much appreciated!

Here is my current R code:

function(largemat, reordervector, IDvector)
nrowz<-nrow(largemat)
maxvalues<-numeric(nrowz)
numberofmaxes<-integer(nrowz)
idvalue<-integer(nrowz)

#this line randomizes the order of the rows
tempmat<-largemat[reordervector,]
tempsims<-NULL
for (i in 1:nrowz){
tempsims<-which(tempmat[i,]==max(tempmat[i,]))
numberofmaxes[i]<-length(tempsims) 
tempindx<-ifelse(length(tempsims)==1, tempsims, sample(x=tempsims, size=1))
#pick off the largest value
distvalues[i]<-tempmat[i, tempindx]
# record the column id name of the largest value
idvalue[i]<-IDvector[tempindx]
#remove the column so that it cannot be selected again
tempmat<-tempmat[,-tempindx]

list(nm=numberofmaxes, dv=distvalues, ids=idvalue)
 }

The function will generate three vectors each of length nrow(largemat) producing the number of maxes for each row, the id name for the column position in which the max was found for a given row and the value of the maximum from the original matrix.

Here is a small example:

largemat is a matrix:

largemat<-rbind(c(.2 .5  .6 .8 .9  1  1  1),
                c(.3 .4  .8 .9  1 .7  1  1),
                c(.5  1  .6 .6 .9 .9 .8 .1)) 

Assume this matrix has already permuted the rows (so reordervector has already been applied to largemat)

first step: determine which columns have largest value for row 1: (6, 7, 8) second step: randomly select one of these columns (say 7) third step: grab id values corresponding to id name vector for column 7 (and record the maximum value for row 1 was in fact 1) fourth step: shrink the matrix to eliminate column 7 for further consideration and repeat steps on row 2 of the new matrix:

largemat<-rbind(c(.2 .5  .6 .8 .9  1  1),
                c(.3 .4  .8 .9  1  1  1),
                c(.5  1  .6 .6 .9 .8 .1)) 

continue- the resulting vectors of ids will be something like maxes: 1, etc. ids: col7id, etc. (interpreting columns to column ids) numberof maxes would be: 3, etc. (corresponding to the number of columns for a given row that had the max value for that row)

  • you should add sample data and expected outputs: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Bulat Jul 17 '16 at 14:29
  • Thank you for the suggestion. Still relatively new to stack overflow for posting. I just added a small example. Couldn't figure out how to attach data so i just used something small. Of course this will run very quickly, but on my large matrices it takes over 8 hours to work through the 5000 by 20000 largematrix i have – Trent Buskirk Jul 17 '16 at 14:45
  • Is `nonprobmatID` supposed to be the `IDvector` argument? `IDvector` is not used within the function. – bgoldst Jul 17 '16 at 14:48
  • 1
    Your code is incomplete. Where does the function end? – Pierre L Jul 17 '16 at 14:51
  • Thank you. Cut and pasted it from R. Added the list at the end and clarified what is returned from the function. Thanks you! – Trent Buskirk Jul 17 '16 at 15:41

2 Answers2

2

I would create helper functions to complete the task. Your use of ifelse is problematic in the temp creation. Using if is more appropriate. A data.frame output made the most sense to me:

choose.max  <- function(x, omit=NULL) {
  x[omit] <- -Inf
  xmax      <- which(x == max(x))
  x_col  <- if(length(xmax) == 1L) xmax else sample(xmax, size=1L)
  x_value   <- max(x)
  num_maxes <- length(xmax)
  return(data.frame(col=x_col, max_value=x_value, num_maxes=num_maxes))
}

max_choice <- function(df) {
  res <- list(choose.max(df[1,,drop=FALSE]))

  for(i in 2:nrow(df)) {
    res[[i]] <- choose.max(x=df[i,,drop=FALSE], omit=sapply(res, '[[', "col"))
  }

  return(do.call("rbind", res))
}

Calling the function max_choice will create the data frame, The first column is for the maximum column selected, then the maximum value of that row, and the number of maxes:

set.seed(143)
mat <- matrix(sample(1:5, 16, TRUE), 4, 4)
max_choice(mat)
#   col max_value num_maxes
# 1   1         5         2
# 2   2         5         1
# 3   4         5         1
# 4   3         1         1

Edit

If speed is important, you can get a boost with this edit:

max_choice <- function(df) {
  res <- vector("list", nrow(df))
  res[[1]] <- choose.max(df[1,,drop=FALSE])

  for(i in 2:nrow(df)) {
    res[[i]] <- choose.max(x=df[i,,drop=FALSE], omit=sapply(res[!sapply(res,is.null)], '[[', "col"))
  }

  return(do.call("rbind", res))
}

Edit 2

This may even be faster still. parallel is a built-in package for parallel processing:

library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
clusterExport(cl, c("mat", "choose.max", "max_choice"))
fast_res <- parLapply(cl, 1, function(x) max_choice(mat))[[1]]
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • i will try this and report back. I hope this can speed things up. It's tremendously slow right now the way i have it. Thank you so much for your suggestions and help! – Trent Buskirk Jul 17 '16 at 15:44
  • THanks for sharing this. I am having a hard time locating the parallel package in R. I am using 3.2.5 Is it not available for this version of R? NEVER mind- found it! – Trent Buskirk Jul 17 '16 at 17:41
  • It should be. What happens when you enter `library(parallel)`? – Pierre L Jul 17 '16 at 17:43
  • got it. But is the parallel version a modified function? Not sure how to use this version. Do i need to insert my dataframe called (bigmat) somewhere in the code? – Trent Buskirk Jul 17 '16 at 17:45
  • also getting this error for the parallel version: rors(val) : one node produced an error: could not find function "max_choice" – Trent Buskirk Jul 17 '16 at 17:47
  • OK all is well- i figured it all out. Thank you so very much! The parallel version really works great. This is superb! Have a great weekend!! – Trent Buskirk Jul 17 '16 at 17:50
  • insert `bigmat` where `mat` is – Pierre L Jul 17 '16 at 17:50
  • You're welcome, but you can tell us all what kind of improvements you have gotten? How long does it take now? – Pierre L Jul 17 '16 at 17:51
  • it went from 3000 seconds to about 150 seconds! That was using your parallel version. Thanks a ton! – Trent Buskirk Jul 18 '16 at 18:21
  • Well i have been using the choice_max function for a bit and it has been working but suddenly i am getting this error message which now makes the function inept. Any ideas? Error in x[omit] <- -Inf : invalid subscript type 'list' – Trent Buskirk Jul 21 '16 at 19:52
  • I don't know how you are getting a list in `x`. If the code worked with your data before, please troubleshoot the code using common techniques. Check that all of the values are numeric `sum(!is.numeric(mydataframe))` should be zero. – Pierre L Jul 21 '16 at 20:59
  • Check everything. Look at pieces of the data and see where the error shows up `max_choice(mydata[1:100,])` to check the first 100 rows. Try checking a few columns to see if there is a problem column `max_choice(mydata[, 1:5])`. Try to locate where exactly the error is occurring. There is very little I can do from this distance. – Pierre L Jul 21 '16 at 21:01
  • I made a mistake in reordering the output fro choose.max and not making the same reordering in max_choice. Thank you so much! – Trent Buskirk Jul 22 '16 at 20:11
2

I made some changes in your code using package Rfast and the code became faster. The problem on your code is that inside the for loop there will be nrowz reallocations of tempmat, this line:

tempmat<-tempmat[,-tempindx]

R is very slow in this for some reason. Rfast has a very fast function which you can use to extract the columns of the matrix. There will be also reallocations but it is much faster. Also I changed the ifelse to a normal if-else as mentioned above.

maximum.values<-function(largemat, reordervector, IDvector){
    nrowz<-nrow(largemat)
    maxvalues<-numeric(nrowz)
    numberofmaxes<-integer(nrowz)
    idvalue<-integer(nrowz)
    distvalues<-numeric(nrowz)
    #this line randomizes the order of the rows
    tempmat<-Rfast::rows(largemat,reordervector)
    tempsims<-NULL
    indices=1:nrowz
    for (i in 1:nrowz){
        tempsims<-which(tempmat[i,]==max(tempmat[i,]))
        numberofmaxes[i]<-length(tempsims) 
        tempindx<-if(length(tempsims)==1) tempsims else sample(x=tempsims, size=1)
        #pick off the largest value
        distvalues[i]<-tempmat[i, tempindx]
        # record the column id name of the largest value
        idvalue[i]<-IDvector[tempindx]
        #remove the column so that it cannot be selected again
        indices<-indices[indices!=tempindx]
        tempmat<-Rfast::columns(largemat,indices)
    }
    data.frame(nm=numberofmaxes, dv=distvalues, ids=idvalue)
}

I don't believe in the idea about vectorization but in the idea of optimization (different things sometimes). If you want your code to be faster you must either change your initial code to some other, or even better move to C++.

Manos Papadakis
  • 564
  • 5
  • 17