4

I am using the mosaic function in the raster package to combine a long (11,000 files) list of rasters using the approach suggested by @RobertH here.

rlist <- sapply(list_names)
rlist$fun <- mean
rlist$na.rm <- TRUE
x <- do.call(mosaic, rlist)

As you might imagine, this eventually overruns my available memory (on several different machines and computing clusters). My question is: Is there a way to reduce the memory usage of either mosaic or do.call? I've tried altering maxmemory in rasterOptions(), but that does not seem to help. Processing the rasters in smaller batches seems problematic because the rasters may be spatially disjunct (i.e., sequential raster files may be located very far from each other). Thanks in advance for any help you can give.

Matt Williamson
  • 117
  • 2
  • 11
  • Can you do this in geographically contiguous batches? – Robert Hijmans Mar 11 '18 at 07:05
  • @RobertH, I think so (assuming I can sort `rlist` using the extent object). Won't that still require me to mosaic these intermediate rasters together (and continue my memory problem)? – Matt Williamson Mar 11 '18 at 14:57
  • I never tried with that many files, but maybe you could try building a gdal vrt file using `gdalUtils::gdalbuildvrt` and then work from there. https://www.rdocumentation.org/packages/gdalUtils/versions/2.0.1.7/topics/gdalbuildvrt. – lbusett Mar 11 '18 at 20:22
  • I do not know if doing this in steps would help. There should not be a memory problem to begin with... Also did you try `rasterOptions(todisk=TRUE)` – Robert Hijmans Mar 12 '18 at 04:58
  • The virtual file idea is interesting. You could make a vrt and then use writeRaster (except that this would be equivalent to `merge`, not `mosaic`) – Robert Hijmans Mar 12 '18 at 05:00
  • @RobertH, I did not have `rasterOptions(todisk=TRUE)` set originally, but tried again with the `do.call(mosaic, rlist)`. That did not solve the problem. – Matt Williamson Mar 12 '18 at 15:18
  • @lbusett, thank you for the suggestion. I've not used `gdalUtils`, but need to be able to sum the overlaps (rather than merge them) so perhaps this solution will not work for me (based on @RobertH's comment) – Matt Williamson Mar 12 '18 at 15:20

1 Answers1

8

Rather than loading all rasters into memory at once (in the mosaic() call), can you process them one at a time? That way, you have your mosaic that updates each time you bring one more raster into memory, but then you can get rid of the new raster and just keep the continuously updating mosaic raster.

Assuming that your rlist object is a list of rasters, I'm thinking of something like:

Pseudocode

  1. Initialize an updating_raster object as the first raster in the list
  2. Loop through each raster in the list in turn, starting from the 2nd raster
  3. Read the ith raster into memory called next_raster
  4. Update the updating_raster object by overwriting it with the mosaic of itself and the next raster using a weighted mean

R code

Testing with the code in the mosaic() help file example...

First generate some rasters and use the standard mosaic method.

library(raster)

r <- raster(ncol=100, nrow=100)
r1 <- crop(r, extent(-10, 11, -10, 11))
r2 <- crop(r, extent(0, 20, 0, 20))
r3 <- crop(r, extent(9, 30, 9, 30))

r1[] <- 1:ncell(r1)
r2[] <- 1:ncell(r2)
r3[] <- 1:ncell(r3)

m1 <- mosaic(r1, r2, r3, fun=mean)

Put the rasters in a list so they are in a similar format as I think you have.

rlist <- list(r1, r2, r3)

Because of the NA handling of the weighted.mean() function, I opted to create the same effect by breaking down the summation and the division into distinct steps...

First initialize the summation raster:

updating_sum_raster <- rlist[[1]]

Then initialize the "counter" raster. This will represent the number of rasters that went into mosaicking at each pixel. It starts as a 1 in all cells that aren't NA. It should properly handle NAs such that it only will increment for a given pixel if a non-NA value was added to the updating sum.

updating_counter_raster <- updating_sum_raster
updating_counter_raster[!is.na(updating_counter_raster)] <- 1

Here's the loop that doesn't require all rasters to be in memory at once. The counter raster for the raster being added to the mosaic has a value of 1 only in the cells that aren't NA. The counter is updated by summing the current counter raster and the updating counter raster. The total sum is updated by summing the current raster values and the updating raster values.

for (i in 2:length(rlist)) {

  next_sum_raster <- rlist[[i]]
  next_counter_raster <- next_sum_raster
  next_counter_raster[!is.na(next_counter_raster)] <- 1

  updating_sum_raster <- mosaic(x = updating_sum_raster, y = next_sum_raster, fun = sum)
  updating_counter_raster <- mosaic(updating_counter_raster, next_counter_raster, fun = sum)

}

m2 <- updating_sum_raster / updating_counter_raster

The values here seem to match the use of the mosaic() function

identical(values(m1), values(m2))
> TRUE

But the rasters themselves aren't identical:

identical(m1, m2)
> FALSE

Not totally sure why, but maybe this gets you closer?

Perhaps compareRaster() is a better way to check:

compareRaster(m1, m2)
> TRUE

Hooray!

Here's a plot!

plot(m1)
text(m1, digits = 2)
plot(m2)
text(m2, digits = 2)

comparison between mosaic function and loop approach

A bit more digging in the weeds...

From the mosaic.R file:

It looks like the mosaic() function initializes a matrix called v to populate with the values from all the cells in all the rasters in the list. The number of rows in matrix v is the number of cells in the output raster (based on the full mosaicked extent and resolution), and the number of columns is the number of rasters to be mosaicked (11,000) in your case. Maybe you're running into the limits of matrix creation in R?

With a 1000 x 1000 raster (1e6 pixels), the v matrix of NAs takes up 41 GB. How big do you expect your final mosaicked raster to be?

r <- raster(ncol=1e3, nrow=1e3)
x <- 11000
v <- matrix(NA, nrow=ncell(r), ncol=x)
format(object.size(v), units = "GB")
[1] "41 Gb"
mikoontz
  • 592
  • 5
  • 11
  • 1
    This works! The trick seems to be: a) reading only 1 raster at a time without keeping every raster in `rlist` in memory, b) limiting the number of NAs added to the mosaic, and c) using `rasterOptions(todisk=TRUE)`. Could possibly speed it up by creating a custom function and using `lapply`, but at this point, I'll stick with what works. Thanks @mikoontz for a very thorough answer. – Matt Williamson Mar 13 '18 at 05:19