-2

I am trying to optimize some code that I have written as it is very slow for large datasets. I am not sure if the following can be done with matrix operations and I would appreciate if someone had any suggestions to make it faster.

I have a matrix with zeros and integers and I would like to shift down the entries of the individual columns by the absolute number of the integer in the the entry.

   [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0   -4    0
[3,]    4    0    0
[4,]   -3   -2    0
[5,]    0    2   -1
[6,]    2   -2    0
[7,]    0    0    0
[8,]   -3   -3    0  

The code I am using is the following:

#data
A<-matrix(data=c(0,0,4,-3,0,2,0,-3,0,-4,0,-2,2,-2,0,-3,0,0,0,0,-1,0,0,0),nrow=8,ncol=3)

#shift function
shift<-function(x)
{
  #create the output matrix
  out<-matrix(data=0,nrow=8,ncol=1)

  #for loop to create the shift matrix
  for(i in seq(1,8,by=1))
  {
    if(i+abs(x[i])<=8)
    {
      #find the non zero
      if(x[i]!=0)
      {
        #if there is already a number put zero  
        if(out[i+abs(x[i]),1]!=0)
        {
          out[i+abs(x[i]),1]=0
        } else {
          #shift
          out[i+abs(x[i]),1]=x[i]
        }
      }
    }
  }

  #return object
  return(out)
}

#run the logic
shift_mat<-sapply(1:ncol(A),FUN=function(k) shift(A[,k]))

and the result is:

   [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
[3,]    0    0    0
[4,]    0    0    0
[5,]    0    0    0
[6,]    0    0   -1
[7,]    0    2    0
[8,]    2   -2    0

The rules are the following for every column:

  1. starting from the top find first entry that is different than zero
  2. shift down by the absolute numbers of that entry
  3. if there is another entry at the targeted point put zero
  4. repeat for the next column

Thanks,

Nikos

user2493820
  • 61
  • 2
  • 8

2 Answers2

2

This is a bit cleaner and about 40% faster using your example on my machine. Maybe the speed improvement will be greater using your larger data?

You should use a matrix of integers. It uses less memory and some operations are faster:

A <- matrix(as.integer(c(0,0,4,-3,0,2,0,-3,0,-4,0,-2,2,
                        -2,0,-3,0,0,0,0,-1,0,0,0)), nrow = 8, ncol = 3)

Each column is a vector, so should be your output. I replaced matrices with vectors. Also made your code more robust without the hardcoded number of rows:

shift <- function(x) {
  n <- length(x)
  y <- rep(0L, n)
  for(i in seq_len(n)) {
    if (x[i] == 0L) next
    j <- i + abs(x[i])
    if (j > n) next
    y[j] <- if (y[j] != 0L) 0L else x[i]
  }
  return(y)
}

You can run it using apply:

shift_mat <- apply(A, 2, shift)
flodel
  • 87,577
  • 21
  • 185
  • 223
  • @Peyton. Thx. I'm not sure there is a choice though, as the order of the operations is important here. This is the kind of algorithm that would greatly benefit from a Rcpp rewrite. – flodel Aug 16 '13 at 21:33
  • @flodel thanks for the suggestions. I guess I would have to look at the Rcpp package. this function is part of a bigger program that has a 4000x200 matrix and takes a lot of time to compute – user2493820 Aug 16 '13 at 21:39
  • @user2493820: I just tested with `A <- matrix(sample(-2:2, 4000*200, replace = TRUE), 4000, 200)` and it took 5 seconds. Is this too long for your need? – flodel Aug 16 '13 at 21:40
  • the program runs a portfolio and not sure how easy would be to isolate this function... I would try to check against the total time. but the Rcpp suggestion is good. the function has a reduced form from the one I am actually using – user2493820 Aug 16 '13 at 21:44
  • @flodel there was a huge improvement on the time of the calculation. Mainly due to the change you made from sapply to apply. Thanks again – user2493820 Aug 17 '13 at 00:19
2

The shift operation can be vectorized. Let's just take the first column of your data to see how:

v = c(0,0,4,-3,0,2,0,-3)

# index of the elements that could be non-zero in the final result
index = ifelse (v != 0 & abs(v) + seq_along(v) <= length(v),
                abs(v) + seq_along(v), 0)
# [1] 0 0 7 7 0 8 0 0


# now just need to filter out the duplicated entries
index = ave(index, index, FUN = function(x) {if (length(x) > 1) 0 else x})
# [1] 0 0 0 0 0 8 0 0

# home at last
res = integer(length(v))
res[index] = v[which(index != 0)]
res
# [1] 0 0 0 0 0 0 0 2

You can then put then above into a function and then lapply over your data.frame or apply on the columns of your matrix.

Unsurprisingly the biggest bottleneck above is the ave function, and you can replace that line with the following data.table construct (don't forget to require(data.table) somewhere) to speed it up considerably:

index = data.table(index)[, index := if(.N > 1) 0 else index, by = index][, index]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Consider `x <- c(3, 2, 1, 0)` and maybe you'll agree that `FUN` should rather be `function(x) {if (length(x) %% 2L) tail(x, 1) else 0L}`. Otherwise, great answer. Took me a while to agree it can be vectorized. – flodel Aug 17 '13 at 00:19
  • @flodel sorry I don't get it - can you point out what the problem with that `x` is? (also is that `x` supposed to be `index` or `v` in the above) I don't get why I'd want to compute length mod 2 here? – eddi Aug 17 '13 at 06:23
  • @flodel oh I think I get what you mean - I think this comes down to how I understand what OP wrote, but I do see your point - my understanding is - if there is any conflict at that point, put a zero, and yours seems to be - if there is sequential conflict only, put a zero – eddi Aug 17 '13 at 06:26