1

Is there a more effective way to match matrix rows when using large matrices? I have a vector with values that correspond to a matrix of 2^N rows. N is typically large e.g., >20. Each row is a unique combination of N={0,1} values and represents a 'position' on a decision space. I.e., for N=3 the rows would be 0 0 0, 0 0 1, 0 1 0, 1 0 0, ..., 1 1 1

I need to determine whether a position is a local maximum, i.e., whether the N neighboring positions are of lower values. For example, to the position 0 0 0, the neighboring positions are 1 0 0, 0 1 0, and 0 0 1, accordingly. I have coded the following solution that does the job but very slowly for large N.

library(prodlim) #for row.match command

set.seed(1234)
N=10

space = as.matrix(expand.grid(rep(list(0:1), N))) #creates all combinations of 0-1 along N-dimensions

performance = replicate(2^N, runif(1, min=0, max=1)) #corresponding values for each space-row (position)

#determine whether a space position is a local maxima, that is, the N neighboring positions are smaller in performance value


system.time({
local_peaks_pos = matrix(NA,nrow=2^N, ncol=1)
for(v in 1:2^N)
{

  for(q in 1:N)
  {
    temp_local_pos = space[v,1:N]
    temp_local_pos[q] = abs(temp_local_pos[q]-1)

    if(performance[row.match(temp_local_pos[1:N], space[,1:N])] > performance[v])
    {
      local_peaks_pos[v,1] = 0
      break
    }

  }

}
local_peaks_pos[is.na(local_peaks_pos)] = 1
})

  user  system elapsed 
   9.94    0.05   10.06 
Alexis
  • 4,950
  • 1
  • 18
  • 37
Dan
  • 59
  • 8
  • Do you _need_ to have the `space` matrix? Each row can be indexed by a single integer with its binary representation corresponding to the position in your decision space. – Gabe Jun 22 '18 at 19:59
  • As long as each position can be addressed in reference to the actual configuration of the binary string of N positions, that would work. But I am not sure if I fully understand yet how to rewrite the code then? Thanks for any suggestions! – Dan Jun 22 '18 at 20:28

2 Answers2

2

As Gabe mentioned in his comment, you can exploit the fact that your decision space can be interpreted as single integers:

set.seed(1234L)
N <- 10L
performance <- runif(2^N)
powers_of_two <- as.integer(rev(2L ^ (0L:(N - 1L))))

is_local_max <- sapply(0L:(2^N - 1), function(i) {
  multipliers <- as.integer(rev(intToBits(i)[1L:N])) * -1L
  multipliers[multipliers == 0L] <- 1L
  neighbors <- i + powers_of_two * multipliers
  # compensate that R vectors are 1-indexed
  !any(performance[neighbors + 1L] > performance[i + 1L])
})

# compensate again
local_peaks_int <- which(is_local_max) - 1L
local_peaks_binary <- t(sapply(local_peaks_int, function(int) {
  as.integer(rev(intToBits(int)[1L:N]))
}))

> head(local_peaks_binary)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    0    0    0    1    0     0
[2,]    0    0    0    0    1    0    0    1    1     0
[3,]    0    0    0    0    1    1    1    1    0     0
[4,]    0    0    0    1    0    0    0    1    1     1
[5,]    0    0    0    1    0    1    0    1    0     1
[6,]    0    0    0    1    1    0    1    1    1     0

In decimal, multipliers contains the the sign of the powers_of_two so that, when added to the current integer, it represents a bit flip in binary. For example, if the original binary was 0 0 and we flip one bit to get 1 0, it's as if we added 2^1 in decimal, but if it was originally 1 0 and we flip one bit to get 0 0, then we subtracted 2^1 in decimal.

Each row in local_peaks_binary is a binary from your decision space, where the least significant bit is on the right. So, for example, the first local peak is a decimal 4.

See this question for the mapping of integers to binary.

EDIT: and if you want to do it in parallel:

library(doParallel)
set.seed(1234L)
N <- 20L
performance <- runif(2^N)
powers_of_two <- as.integer(rev(2 ^ (0:(N - 1))))

num_cores <- detectCores()
workers <- makeCluster(num_cores)
registerDoParallel(workers)

chunks <- splitIndices(length(performance), num_cores)
chunks <- lapply(chunks, "-", 1L)
local_peaks_int <- foreach(chunk=chunks, .combine=c, .multicombine=TRUE) %dopar% {
  is_local_max <- sapply(chunk, function(i) {
    multipliers <- as.integer(rev(intToBits(i)[1L:N])) * -1L
    multipliers[multipliers == 0L] <- 1L
    neighbors <- i + powers_of_two * multipliers
    # compensate that R vectors are 1-indexed
    !any(performance[neighbors + 1L] > performance[i + 1L])
  })

  # return
  chunk[is_local_max]
}

local_peaks_binary <- t(sapply(local_peaks_int, function(int) {
  as.integer(rev(intToBits(int)[1L:N]))
}))

stopCluster(workers); registerDoSEQ()

The above completes in ~2.5 seconds in my system, which has 4 CPU cores.


Here is a C++ version that uses multi-threading but, at least in my system with 4 threads, it doesn't seem faster than Gabe's Fortran version. However, when I try to run Gabe's Fortran code in a new session, I get the following error with N <- 29L: cannot allocate vector of size 4.0 Gb.

EDIT: Apparently I changed something important along the way, because after testing again, the C++ version actually seems faster.

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel)]]
#include <cstddef> // size_t
#include <vector>
#include <Rcpp.h>
#include <RcppParallel.h>

using namespace std;
using namespace Rcpp;
using namespace RcppParallel;

class PeakFinder : public Worker
{
public:
  PeakFinder(const NumericVector& performance, vector<int>& peaks, const int N)
    : performance_(performance)
    , peaks_(peaks)
    , N_(N)
  { }

  void operator()(size_t begin, size_t end) {
    vector<int> peaks;
    for (size_t i = begin; i < end; i++) {
      bool is_local_peak = true;
      unsigned int mask = 1;
      for (int exponent = 0; exponent < N_; exponent++) {
        unsigned int neighbor = static_cast<unsigned int>(i) ^ mask; // bitwise XOR
        if (performance_[i] < performance_[neighbor]) {
          is_local_peak = false;
          break;
        }

        mask <<= 1;
      }

      if (is_local_peak)
        peaks.push_back(static_cast<int>(i));
    }

    mutex_.lock();
    peaks_.insert(peaks_.end(), peaks.begin(), peaks.end());
    mutex_.unlock();
  }

private:
  const RVector<double> performance_;
  vector<int>& peaks_;
  const int N_;
  tthread::mutex mutex_;
};

// [[Rcpp::export]]
IntegerVector local_peaks(const NumericVector& performance, const int N) {
    vector<int> peaks;
    PeakFinder peak_finder(performance, peaks, N);
    // each thread call will check at least 1024 values
    parallelFor(0, performance.length(), peak_finder, 1024);

    IntegerVector result(peaks.size());
    int i = 0;
    for (int peak : peaks) {
        result[i++] = peak;
    }
    return result;
}

After saving the C++ code in local-peaks.cpp, this code:

library(Rcpp)
library(RcppParallel)

sourceCpp("local-peaks.cpp")

set.seed(1234L)
N <- 29L
performance <- runif(2^N)
system.time({
    local_peaks_int <- local_peaks(performance, N)
})

finished in ~2 seconds (without considering the time required to allocate performance).

If you do need the binary representation, you can change local_peaks like this (see this question):

// [[Rcpp::export]]
IntegerMatrix local_peaks(const NumericVector& performance, const int N) {
  vector<int> peaks;
  PeakFinder peak_finder(performance, peaks, N);
  // each thread call will check at least 1024 values
  parallelFor(0, performance.length(), peak_finder, 1024);

  // in case you want the same order every time, #include <algorithm> and uncomment next line
  // sort(peaks.begin(), peaks.end());

  IntegerMatrix result(peaks.size(), N);
  int i = 0;
  for (int peak : peaks) {
    for (int j = 0; j < N; j++) {
      result(i, N - j - 1) = peak & 1;
      peak >>= 1;
    }

    i++;
  }

  return result;
}
Alexis
  • 4,950
  • 1
  • 18
  • 37
  • Thank you, this is incredibly fast for N=20 and solves my problem. Once I get to N=30, even with parallel processing and 4-cores it seems to hit a limit again though. Because I need to run this over and over again for hundreds of iterations monte-carlo simulation, I am wondering if it is helpful to pre-index for each position all N neighbors, save this and then only have the performance checked each time needed since the logic of the decision space doesn't change. Basically separating the identification of neighboring positions and the performance check? Thanks again! – Dan Jun 23 '18 at 03:46
  • You can still avoid that. I've added a C++ version, but see the comments. – Alexis Jun 23 '18 at 12:38
  • Ooh I like it! Although I hope we don't start a Fortran vs C++ flame war, haha. I do find it curious that one fits in memory for you but the other doesn't. It's possible that since there's less interoperability with Fortran, the arrays are being passed by value and it needs to make duplicates. – Gabe Jun 23 '18 at 14:34
  • I've never used Fortran so I can't flame anyway =P. I also have 16GB RAM, so who knows... – Alexis Jun 23 '18 at 14:44
2

Here is one solution that follows the same general structure as your example code. intToBits and packBits map to and from the binary representation for each integer (subtracting one to start at zero). The inner loop flips each of the N bits to get the neighbors. On my laptop, this runs in a fraction of a second for N=10 and around a minute for N=20. The commented code stores some information from neighbors already tested so as to not redo the calculation. Uncommenting those lines makes it run in about 35 seconds for N=20.

loc_max <- rep(1, 2^N)
for (v in 1:2^N){
  ## if (loc_max[v] == 0) next
  vbits <- intToBits(v-1)
  for (q in 1:N){
    tmp <- vbits
    tmp[q] <- !vbits[q]
    pos <- packBits(tmp, type = "integer") + 1
    if (performance[pos] > performance[v]){
      loc_max[v] <- 0
      break
    ## } else {
    ##   loc_max[pos] <- 0
    }
  }
}

identical(loc_max, local_peaks_pos[, 1])
## [1] TRUE

EDIT: It sounds like you need every bit of speed possible, so here's another suggestion that relies on compiled code to run significantly faster than my first example. A fraction of a second for N=20 and a bit under 20 seconds for N=29 (the largest example I could fit in my laptop's RAM).

This is using a single core; you could either parallelize this, or alternatively run it in a single core and parallelize your Monte Carlo simulations instead.

library(inline)

loopcode <-
"  integer v, q, pos
   do v = 0, (2**N)-1
      do q = 0, N-1
         if ( btest(v,q) ) then
            pos = ibclr(v, q)
         else
            pos = ibset(v, q)
         end if
         if (performance(pos) > performance(v)) then
            loc_max(v) = 0
            exit
         end if
      end do
   end do
"

loopfun <- cfunction(sig = signature(performance="numeric", loc_max="integer", n="integer"),
                     dim=c("(0:(2**n-1))", "(0:(2**n-1))", ""),
                     loopcode,
                     language="F95")

N <- 20
performance = runif(2^N, min=0, max=1)
system.time({
  floop <- loopfun(performance, rep(1, 2^N), N)
})
##  user  system elapsed
## 0.049   0.003   0.052

N <- 29
performance = runif(2^N, min=0, max=1)
system.time({
  floop <- loopfun(performance, rep(1, 2^N), N)
})
##   user  system elapsed
## 17.892   1.848  19.741

I don't think pre-computing the neighbors would help much here since I'd guess the comparisons accessing different sections of such a large array are the most time consuming part.

Gabe
  • 649
  • 5
  • 6
  • Fortran's *fast*. Unfortunately I can't run your code with `N <- 29`, so either your laptop has more RAM than my desktop, or my Fortran compiler is doing things differently. – Alexis Jun 23 '18 at 12:54
  • Huh interesting... I have 16 GB of RAM, and `N=29` uses a bit over 15 of that for me. Since he was allocating a 2^NxN _matrix_ before, I figured he had RAM to spare. – Gabe Jun 23 '18 at 14:41
  • I should add that I'm running Linux so the footprint of the OS with nothing else running is pretty small. – Gabe Jun 23 '18 at 14:48
  • Me too, x64. I didn't even see a spike in RAM, it just determined immediately that it wouldn't be able to run it. – Alexis Jun 23 '18 at 14:53
  • Very odd. And I believe it should be using gfortran for both of us then, unless your compilation flags are different. – Gabe Jun 23 '18 at 15:02
  • Thank you! I indeed parallel-process the Monte Carlo runs on a cluster computer. Regarding the ram allocation, I am running out of RAM even on my 32GB laptop but have a few nodes with very large ram (756GB) available on the cluster. However, the performance values can be calculated on the fly when using unique seeds. So I am now rewriting the code so that the performance values do not need to be stored. Thanks again! – Dan Jun 24 '18 at 19:06