1

I would like to rasterize a very large vector file to 25m and have had some success with the 'cluster' package, adapting the qu's here and here, which worked nicely for that particular data.

However I now have a larger vector file that needs rasterizing and have access to a cluster that uses snowfall. I'm not used to cluster functions and i'm just not sure how to set up sfLapply. I am consistently getting the following sort of error as sfLapply is called in the cluster:

Error in checkForRemoteErrors(val) : 
  one node produced an error: 'quote(96)' is not a function, character or symbol
Calls: sfLapply ... clusterApply -> staticClusterApply -> checkForRemoteErrors

my full code:

library(snowfall)
library(rgeos)
library(maptools)
library(raster)
library(sp)

setwd("/home/dir/")

# Initialise the cluster...
hosts = as.character(read.table(Sys.getenv('PBS_NODEFILE'),header=FALSE)[,1]) # read the nodes to use
sfSetMaxCPUs(length(hosts)) # make sure the maximum allowed number of CPUs matches the number of hosts
sfInit(parallel=TRUE, type="SOCK", socketHosts=hosts, cpus=length(hosts), useRscript=TRUE) # initialise a socket cluster session with the named nodes
sfLibrary(snowfall)

# read in required data

shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
crs(shp) <- BNG

### rasterize the uniques to 25m and write (GB and clipped) ###
rw <- raster(res=c(25,25), xmn=0, xmx=600000, ymn=0, ymx=1000000, crs=BNG)

# Number of polygons features in SPDF
features <- 1:nrow(shp[,])

# Split features in n parts
n <- 96
parts <- split(features, cut(features, n))

rasFunction = function(X, shape, raster, nparts){
    ras = rasterize(shape[nparts[[X]],], raster, 'CODE')
    return(ras)
}

# Export everything in the workspace onto the cluster...
sfExportAll()

# Distribute calculation across the cluster nodes...
rDis = sfLapply(n, fun=rasFunction,X=n, shape=shp, raster=rw, nparts=parts) # equivalent of sapply
rMerge <- do.call(merge, rDis)

writeRaster(rMerge, filename="my_data_25m",  format="GTiff", overwrite=TRUE)

# Stop the cluster...
sfStop()

i've tried a number of things, changing the function and sfLapply, but i just can't get this to run. thanks

Community
  • 1
  • 1
Sam
  • 1,400
  • 13
  • 29
  • 1
    If you're looking for speed with rasterizing (large) vector data, look at `gdalUtils::gdal_rasterize`. This is usually much faster than `raster::rasterize`. – joberlin Jan 31 '17 at 21:19
  • ok thank you i'll look at that as well – Sam Feb 01 '17 at 08:58
  • i dropped the rasFunction and changed rDis to "rDis = sfLapply(1:n, fun=function(x) rasterize(shp[parts[[x]],], rw, 'CODE'))" but now i get Error in checkForRemoteErrors(val) : 96 nodes produced errors; first error: 'data' must be of a vector type, was 'NULL'. Stumped. – Sam Feb 01 '17 at 09:16
  • @ joberlin Ooooh. I've been looking for a way to speed up vector - > raster operations... – SeldomSeenSlim Feb 01 '17 at 16:25
  • i'm having good 1st impressions of gdalUtils::gdal_rasterize, will update this week – Sam Feb 01 '17 at 16:31
  • Ok so maybe I'm missing something here, but you know that just changing crs(something)<-crs(somethingElse) doesn't actually reproject the data right? All it does is overwrite the CRS on the file header with out doing anything to reproject the data within the .shp. So I'm not super familiar with your data, maybe its projected correctly outright and you just need to define it, but if your first n part is coming up null (because there is nothing there to rasterize) could it be because its empty and the .shp isn't correctly projected to your raster? – SeldomSeenSlim Feb 01 '17 at 16:34
  • @Aron yes i know, the projection was undefined to begin with, but all ideas welcome, sometimes it is something like this that you just dont consider! – Sam Feb 02 '17 at 08:40

2 Answers2

0

Because I can't do formatting in comments:

library(maptools)
shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"

shp.2 <- spTransform(shp, BNG)
#Continue as before

Overwriting a projection != reprojecting data.

SeldomSeenSlim
  • 811
  • 9
  • 24
  • no sorry, i meant all ideas welcome regarding why the script might error!! i know the difference between undefined and reprojecting :) – Sam Feb 03 '17 at 08:35
0

Ok so I abandoned snowfall and I looked into gdalUtils::gdal_rasterize instead and found a lot of benefits to using it (with one downside that someone might be able to answer?)

Context & Issue: My vector data exist inside an ESRI file Geodatabase and require some processing pre rasterization. No problem, rgdal::readOGR is fine. However as gdal_rasterize requires a pathname to the vector data, i had trouble here because I could not write out my processed vector data, they exceed the max file size for a shapefile outside of a geodatabase and gdal_rasterize will not accept objects, paths to .gdbs or .Rdata/.rds files. How do I pass an object to gdal_rasterize??

So I wrote out the large shapefile in segments equal to number of processors.

Originally raster::rasterize was used as I could simply pass the vector object stored in memory to rasterize without the writing problem (though I would have liked to have it written), rasterizing this data to 25m. This took a pretty long time, even in parallel.

Solution: gdal_rasterize in parallel.

# gdal_rasterize in parallel
require(gdalUtils)
require(rgdal)
require(rgeos)
require(cluster)
require(parallel)
require(raster)

# read in vector data
shape <- readOGR("./mygdb.gdb", layer="mydata",stringsAsFactors=F)

## do all the vector processing etc ##

# split vector data into n parts, the same as number of processors (minus 1)
npar <- detectCores() - 1
features <- 1:nrow(shape[,])
parts <- split(features, cut(features, npar))

# write the vector parts out
for(n in 1:npar){
  writeOGR(shape[parts[[n]],], ".\\parts", paste0("mydata_p",n), driver="ESRI Shapefile")
}

# set up and write a blank raster for gdal_rasterize for EACH vector segment created above
r <- raster(res=c(25,25), xmn=234000, xmx=261000, ymn=229000, ymx=256000, crs=projection(shape))    
for(n in 1:npar){
writeRaster(r, filename=paste0(".\\gdal_p",n,".tif"), format="GTiff", overwrite=TRUE)
}

# set up cluster and pass required packages and objects to cluster
cl <- makeCluster(npar)
clusterEvalQ(cl, sapply(c('raster', 'gdalUtils',"rgdal"), require, char=TRUE))
clusterExport(cl, list("r","npar"))

# parallel apply the gdal_rasterize function against the vector parts that were written, 
# same number as processors, against the pre-prepared rasters
parLapply(cl = cl, X = 1:npar, fun = function(x) gdal_rasterize(src_datasource=paste0(".\\parts\\mydata_p",x,".shp"),
dst_filename=paste0(".\\gdal_p",n,".tif"),b=1,a="code",verbose=F,output_Raster=T))

# There are now n rasters representing the n segments of the original vector file
# read in the rasters as a list, merge and write to a new tif. 
s <- lapply(X=1:npar, function(x) raster(paste0(".\\gdal_p",n,".tif")))
s$filename <- "myras_final.tif"
do.call(merge,s)
stopCluster(cl)

The time (split 60% for vector reading/processing/writing & 40% for raster generation and rasterization) for the entire job in this code was around 9 times quicker than raster::rasterize in parallel.

Note: I tried this initially by splitting the vectors into n parts but creating only 1 blank raster. I then wrote to the same blank raster from all cluster nodes simultaneously but this corrupted the raster and made it unusable in R/Arc/anything (despite going through the function without error). Above is a more stable way, but n blank rasters have to be made instead of 1, increasing processing time, plus merging n rasters is extra processing.

caveat - raster::rasterize in parallel did not have writeRaster inside the rasterize function but rather as a separate line, which will have increased processing time in the original run due to storage to temp files etc.

EDIT: Why are the frequency tables from the raster from gdal_rasterize not the same as raster::rasterize? I mean with 100million cells i expect a bit of difference but for some codes it was a few 1000 cells different. I thought they both rasterized by centroid?

Sam
  • 1,400
  • 13
  • 29