9

I have a data table with nrow being around a million or two and ncol of about 200.

Each entry in a row has a coordinate associated with it.

Tiny portion of the data:

[1,] -2.80331471  -0.8874522 -2.34401863   -3.811584   -2.1292443
[2,]  0.03177716   0.2588624  0.82877467    1.955099    0.6321881
[3,] -1.32954665  -0.5433407 -2.19211837   -2.342554   -2.2142461
[4,] -0.60771429  -0.9758734  0.01558774    1.651459   -0.8137684

Coordinates for the first 4 rows:

9928202 9928251 9928288 9928319

What I would like is a function that given the data and window-size would return a data table of the same size with a mean sliding window applied on each column. Or in other words - for each row entry i it would find entries with coordinates between coords[i]-windsize and coords[i]+windsize and replace the initial value with the mean of the values inside that interval (separately for each column).

Speed is the main issue here.

Here is my first take of such function.

doSlidingWindow <- function(intensities, coords, windsize) {
windHalfSize <- ceiling(windsize/2)
### whole range inds
RANGE <- integer(max(coords)+windsize)
RANGE[coords] <- c(1:length(coords)[1])

### get indeces of rows falling in each window
COORDS <- as.list(coords)
WINDOWINDS <- sapply(COORDS, function(crds){ unique(RANGE[(crds-windHalfSize):
    (crds+windHalfSize)]) })

### do windowing

wind_ints <- intensities
wind_ints[] <- 0
for(i in 1:length(coords)) {
    wind_ints[i,] <- apply(as.matrix(intensities[WINDOWINDS[[i]],]), 2, mean)
}
return(wind_ints)
}

The code before the last for loop is quite fast and it gets me a list of the indexes I need to use for each entry. However then everything falls apart since I need to grind the for loop a million times, take subsets of my data table and also make sure that I have more than one row to be able to work with all the columns at once inside apply.

My second approach is to just stick the actual values in the RANGE list, fill the gaps with zeroes and do rollmean from zoo package, repeated for each column. But this is redundant since rollmean will go through all the gaps and I will only be using the values for original coordinates in the end.

Any help to make it faster without going to C would be very appreciated.

Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
  • I'm not an expert w/ `zoo` , but are you sure using `rollmean(data,fill=NA)` won't be fast enough? – Carl Witthoft Jan 07 '13 at 12:15
  • If you anyway store the data in a database: sqldf in a database with PostgreSQL can do running window stats. – Dieter Menne Jan 07 '13 at 13:51
  • to Carl: rollmean is sure fast enough. But it cannot handle intervals on arbitrary coordinates. It just uses a fixed window size on the time series and the time series has regular intervals. In this case the intervals are not regular and the spaces between two points can be arbitrary. So If I fill all the gaps with zeroes for zoo package - I would get a vector of length around 500 million. To do it with rollmean on a dataframe is pain, especially when I only need a few million out of those 500 computed with rollmean. – Karolis Koncevičius Jan 07 '13 at 17:46
  • In last loop it's better to change line to: `wind_ints[i,] <- apply(matrix(intensities[WINDOWINDS[[i]],], ncol=ncol(intensities)), 2, mean)`. When it's only one row in a window your code causes wrong results. – redmode Jan 20 '13 at 02:11

2 Answers2

7

Data generation:

N <- 1e5 # rows
M <- 200 # columns
W <- 10  # window size

set.seed(1)
intensities <- matrix(rnorm(N*M), nrow=N, ncol=M)
coords <- 8000000 + sort(sample(1:(5*N), N))

Original function with minor modifications I used for benchmarks:

doSlidingWindow <- function(intensities, coords, windsize) {
  windHalfSize <- ceiling(windsize/2)
  ### whole range inds
  RANGE <- integer(max(coords)+windsize)
  RANGE[coords] <- c(1:length(coords)[1])

  ### get indices of rows falling in each window
  ### NOTE: Each elements of WINDOWINDS holds zero. Not a big problem though.
  WINDOWINDS <- sapply(coords, function(crds) ret <- unique(RANGE[(crds-windHalfSize):(crds+windHalfSize)]))

  ### do windowing
  wind_ints <- intensities
  wind_ints[] <- 0
  for(i in 1:length(coords)) {
    # CORRECTION: When it's only one row in window there was a trouble
    wind_ints[i,] <- apply(matrix(intensities[WINDOWINDS[[i]],], ncol=ncol(intensities)), 2, mean)
  }
  return(wind_ints)
}

POSSIBLE SOLUTIONS:


1) data.table

data.table is known to be fast with subsetting, but this page (and other related to sliding window) suggests, that this is not the case. Indeed, data.table code is elegant, but unfortunately very slow:

require(data.table)
require(plyr)
dt <- data.table(coords, intensities)
setkey(dt, coords)
aaply(1:N, 1, function(i) dt[WINDOWINDS[[i]], sapply(.SD,mean), .SDcols=2:(M+1)])

2) foreach+doSNOW

Basic routine is easy to run in parallel, so, we can benefit from it:

require(doSNOW)
doSlidingWindow2 <- function(intensities, coords, windsize) {
  NC <- 2 # number of nodes in cluster
  cl <- makeCluster(rep("localhost", NC), type="SOCK")
  registerDoSNOW(cl)

  N <- ncol(intensities) # total number of columns
  chunk <- ceiling(N/NC) # number of columns send to the single node

  result <- foreach(i=1:NC, .combine=cbind, .export=c("doSlidingWindow")) %dopar% {
    start <- (i-1)*chunk+1
    end   <- ifelse(i!=NC, i*chunk, N)
    doSlidingWindow(intensities[,start:end], coords, windsize)    
  }

  stopCluster(cl)
  return (result)
}

Benchmark shows notable speed-up on my Dual-Core processor:

system.time(res <- doSlidingWindow(intensities, coords, W))
#    user  system elapsed 
# 306.259   0.204 307.770
system.time(res2 <- doSlidingWindow2(intensities, coords, W))
#  user  system elapsed 
# 1.377   1.364 177.223
all.equal(res, res2, check.attributes=FALSE)
# [1] TRUE

3) Rcpp

Yes, I know you asked "without going to C". But, please, take a look. This code is inline and rather straightforward:

require(Rcpp)
require(inline)
doSlidingWindow3 <- cxxfunction(signature(intens="matrix", crds="numeric", wsize="numeric"), plugin="Rcpp", body='
  #include <vector>
  Rcpp::NumericMatrix intensities(intens);
  const int N = intensities.nrow();
  const int M = intensities.ncol();
  Rcpp::NumericMatrix wind_ints(N, M);

  std::vector<int> coords = as< std::vector<int> >(crds);
  int windsize = ceil(as<double>(wsize)/2);  

  for(int i=0; i<N; i++){
    // Simple search for window range (begin:end in coords)
    // Assumed that coords are non-decreasing
    int begin = (i-windsize)<0?0:(i-windsize);
    while(coords[begin]<(coords[i]-windsize)) ++begin;
    int end = (i+windsize)>(N-1)?(N-1):(i+windsize);
    while(coords[end]>(coords[i]+windsize)) --end;

    for(int j=0; j<M; j++){
      double result = 0.0;
      for(int k=begin; k<=end; k++){
        result += intensities(k,j);
      }
      wind_ints(i,j) = result/(end-begin+1);
    }
  }

  return wind_ints;
')

Benchmark:

system.time(res <- doSlidingWindow(intensities, coords, W))
#    user  system elapsed 
# 306.259   0.204 307.770
system.time(res3 <- doSlidingWindow3(intensities, coords, W))
#  user  system elapsed 
# 0.328   0.020   0.351
all.equal(res, res3, check.attributes=FALSE)
# [1] TRUE

I hope results are quite motivating. While data fits in memory Rcpp version is pretty fast. Say, with N <- 1e6 and M <-100 I got:

   user  system elapsed 
  2.873   0.076   2.951

Naturally, after R starts using swap everything slows down. With really large data that doesn't fit in memory you should consider sqldf, ff or bigmemory.

Community
  • 1
  • 1
redmode
  • 4,821
  • 1
  • 25
  • 30
  • Did you intend for section 1 to state that `data.table` is not fast at subsetting, and state that although `data.table` is elegant it's not actually fast? That benchmark appears to use `plyr` as well and times the combination. It seems to be passing vectors of row numbers to `data.table` to take many copies separately. – Matt Dowle Jan 20 '13 at 13:17
  • This is a more accurate link: [do rolling mean in `j` not repeated `i` subsets](http://stackoverflow.com/a/12157723/403310). – Matt Dowle Jan 20 '13 at 13:25
  • @Matthew Dowle, I know `data.table` to be pretty fast at subsetting, that is why I gave it a try. But it seems to be not the right tool for rolling window (or at least, I didn't cope to use `data.table` correctly to speed-up computations). – redmode Jan 20 '13 at 19:36
  • @Matthew Dowle, BTW, do you think it's better to remove section 1 from the answer? – redmode Jan 20 '13 at 19:36
  • 1
    It's ok, these comments cover it. It's good to have bad use of data.table online as well. – Matt Dowle Jan 20 '13 at 22:00
  • @redmode How much tweaking is required here to handle `coords` that start close to 0? That is, the code throws errors if I do `coords <- sort(sample(1:(5*N), N))`. – Jota Jul 15 '14 at 01:13
  • @Frank, it's strange. I've just run `coords <- sort(sample(1:(5*N), N))` followed by `doSlidingWindow3(intensities, coords, W)` - it's working as is. What version of the function have you used? – redmode Jul 15 '14 at 13:10
  • @redmode, I used `doSlidingWindow`, originally. Now, I tried `doSlidingWindow3`, and it works, though it's worth noting that I needed to input 1/2 my desired window size instead of the window size itself and I got a cygwin warning message after defining the function. – Jota Jul 15 '14 at 14:15
  • @Frank, what kind of message? I can't reproduce it though. – redmode Jul 15 '14 at 22:56
  • `cygwin warning: MS-DOS style path detected: C:/PROGRA~1/R/R-31~1.0/etc/x64/Makeconf Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-31~1.0/etc/x64/Makeconf CYGWIN environment variable option "nodosfilewarning" turns off this warning. Consult the user's guide for more details about POSIX paths: http://cygwin.com/cygwin-ug-net/using.html#using-pathnames1` – Jota Jul 16 '14 at 14:10
  • @Frank, it's just CYGWIN warning, you can learn how to turn it off here: https://cygwin.com/cygwin-ug-net/setup-env.html. This answer can also be helpful: http://stackoverflow.com/questions/9764495/how-to-get-rcpp-to-work-in-r-on-a-windows-xp-platform – redmode Jul 16 '14 at 17:18
1

Rollapply works great with a small dataset. However, if you are working with several million rows (genomics) it is quite slow.

The following function is super fast:

data <- c(runif(100000, min=0, max=.1),runif(100000, min=.05, max=.1),runif(10000, min=.05, max=1), runif(100000, min=0, max=.2))
slideFunct <- function(data, window, step){
  total <- length(data)
  spots <- seq(from=1, to=(total-window), by=step)
  result <- vector(length = length(spots))
  for(i in 1:length(spots)){
    result[i] <- mean(data[spots[i]:(spots[i]+window)])
  }
  return(result)
}

Details here.

Alex M
  • 2,756
  • 7
  • 29
  • 35