1

How to efficiently compare pairs of distribution rasters (raster layers containing only 0 and 1)? I need to get a measure of the similarity among ~6500 individual global rasters. Istat from SDMTools should do the job.

Here is my code:

library(raster)
library(SDMTools)

Create reproducible example data: rasters with values of 0 and 1

# first raster
r1 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=18000000, ymn=-9000000, ymx=9000000, 
             crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs", 
             resolution=10000, vals=0)
r2 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=0, ymn=0, ymx=9000000, 
             crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs", 
             resolution=10000, vals=2)
r12 <- mosaic(r1, r2, fun=mean)

# second raster
r3 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=18000000, ymn=-9000000, ymx=9000000, 
             crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs", 
             resolution=10000, vals=0)
r4 <- raster(nrow=1800, ncol=3600, xmn=-12000000, xmx=15000000, ymn=2000000, ymx=3000000, 
             crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs", 
             resolution=10000, vals=2)
r34 <- mosaic(r3, r4, fun=mean)

List rasters

files_list <- list(r12, r34)

Create empty matrix to fill with data from loop

ras_comp <- matrix(NA, nrow=length(files_list), ncol=length(files_list))
ras_comp
# label rows and columns of matrix
rownames(ras_comp) <- c("r12", "r34")
colnames(ras_comp) <- c("r12", "r34")
ras_comp

Loop to compare all possible pairs of matrices/rasters

for (i in 1:length(files_list)) {
  # load raster i
  ras_i <- as.matrix(files_list[[i]])

  for (j in 1:length(files_list)) {
    # load raster j
    ras_j <- as.matrix(files_list[[j]])

    # compare both rasters
    ras_Istat <- Istat(ras_i, ras_j, old=F)

    # write value into matrix
    ras_comp[i,j] <- ras_Istat
  }
}

Check final matrix

ras_comp
> ras_comp
          r12       r34
r12 1.0000000 0.1814437
r34 0.1814437 1.0000000

Converting the rasters to matrices with as.matrix reduces the calculation time significantly and the resulting final table is what I need but doing this for thousands of rasters takes forever to complete. How can the code be optimized in order to compare the rasters in a more efficient way?

SophiaL
  • 61
  • 7
  • 1
    for one thing, this will always be a symmetric matrix and the diagonal will always be 1. So only need to calculate the upper triangle. I.e loop over `i in 1:(N-1)` and `j in 2:N`. This will cut time by more than half. But maybe half of "forever" is still too long. – dww Jan 28 '20 at 02:28
  • 1
    If you can be sure that all the values in all your rasters pass `is.finite` then there's a whole load of optimisations you can make. `Istat` tests both args with `is.finite` and then computes on where values are finite in both rasters. If you know both are finite beforehand you can compute `pr=r/sum(r)` *once* for each raster and then you only need to compute `1 - (sum((sqrt(px) - sqrt(py))^2))/2` over pairs of those to get the statistic. Each call the `Istat` has to do the testing and scaling, which takes 90% of the time `Istat` runs. – Spacedman Jan 28 '20 at 10:58
  • @dww, that´s a good point. I changed to code to `i in 1:(N-1)` and `j in (i+1):N` to only calculate the upper triangle. @Spacedman, all rasters are finite. Thanks for the explanation! – SophiaL Jan 29 '20 at 22:31

1 Answers1

1

Istat does a bunch of tests and scalings before a simple computation. If you know those tests pass you can do the scalings as a one off and work on the scaled values. It does:

if (length(which(dim(x) == dim(y))) != 2) 
    stop("matrix / raster objects must be of the same extent")
if (min(c(x, y), na.rm = T) < 0) 
    stop("all values must be positive")

Then it checks where both rasters are "finite", which includes NA values:

pos = which(is.finite(x) & is.finite(y))

It then computes the scaled values of the rasters:

px = x[pos]/sum(x[pos])
py = y[pos]/sum(y[pos])
H = sqrt(sum((sqrt(px) - sqrt(py))^2))

If old=FALSE like you have then it returns:

    return(1 - (H^2)/2)

> Istat(r12,r34)
[1] 0.1814437

If I cut out the tests and write a function that works on the scaled values I can get it down to:

fIstat = function(px,py){
    1 - (sum((sqrt(px) - sqrt(py))^2))/2
}

Test by scaling the rasters and running:

r12px = r12[]/sum(r12[])
r34px = r34[]/sum(r34[])
fIstat(r12px, r34px)
# [1] 0.1814437

Same value. Great, but is it quicker?

> microbenchmark(fIstat(r12px, r34px), Istat(r12,r34))
Unit: milliseconds
                 expr        min         lq       mean     median         uq
 fIstat(r12px, r34px)   49.95867   78.28649   78.10863   79.45235   80.85234
      Istat(r12, r34) 1084.84825 1181.31116 1217.64122 1212.93180 1263.50811
       max neval
  106.6803   100
 1349.0239   100

Yes, by a large factor.

So... if your data have no missing values or infinities, create a files_list of these scaled raster values, call my fIstat, only loop over the upper triangle, and you should speed this up by a factor of 10.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • This detailed answer is very helpful - thanks a lot! I suspect you got the information on how the function actually works from the respective paper? Or where else can this be found? The calculation time is still massive but some parallel processing might be the way in this case... – SophiaL Jan 29 '20 at 22:39
  • 1
    I looked at the source code for `Istat` - do `library(SDMTools)` and then `Istat` and it prints it out. – Spacedman Jan 29 '20 at 22:58