The best way to do this would be to use dask.delayed
. In this case, you'd create a delayed function to read the array, and then compose a dask array from those delayed
objects using the da.from_delayed
function. Something along the lines of:
# This function isn't run until compute time
@dask.delayed(pure=True)
def load(file):
return gdal.Open(file).ReadAsArray()
# Create several delayed objects, then turn each into a dask
# array. Note that you need to know the shape and dtype of each
# file
data = [da.from_delayed(load(f), shape=shape_of_f, dtype=dtype_of_f)
for f in files]
x = da.stack(data, axis=0)
Note that this makes a single task for loading each file. If the individual files are large, you may want to chunk them yourself in the load
function. I'm not familiar with gdal, but from a brief look at the ReadAsArray
method this may be doable with the xoff
/yoff
/xsize
/ysize
parameters (not sure). You'd have to write this code yourself, but it may be more performant for large files.
Alternatively you could use the code above, and then call rechunk
to rechunk into smaller chunks. This would still result in reading in each file in a single task, but subsequent steps could work with smaller chunks. Whether this is worth it or not depends on the size of your individual files.
x = x.rechunk((500, 500, 500)) # or whatever chunks you want