10

I am working with rather large arrays created from large image files. I was having issues with using too much memory and decided to try using numpy.memmap arrays instead of the standard numpy.array. I was able to create a memmap and load the data into it from my image file in chunks, but I'm not sure how to load the result of an operation into a memmap.

For example, my image files are read into numpy as binary integer arrays. I have written a function that buffers (expands) any region of True cells by a specified number of cells. This function converts the input array to Boolean using array.astype(bool). How would I make the new Boolean array created by array.astype(bool) a numpy.memmap array?

Also, if there is a True cell closer to the edge of the input array than the specified buffer distance, the function will add rows and/or columns to the edge of the array to allow for a complete buffer around the existing True cell. This changes the shape of the array. Is it possible to change the shape of a numpy.memmap?

Here is my code:

def getArray(dataset):
    '''Dataset is an instance of the GDALDataset class from the
    GDAL library for working with geospatial datasets

    '''
    chunks = readRaster.GetArrayParams(dataset, chunkSize=5000)
    datPath = re.sub(r'\.\w+$', '_temp.dat', dataset.GetDescription())
    pathExists = path.exists(datPath)
    arr = np.memmap(datPath, dtype=int, mode='r+',
                    shape=(dataset.RasterYSize, dataset.RasterXSize))
    if not pathExists:
        for chunk in chunks:
            xOff, yOff, xWidth, yWidth = chunk
            chunkArr = readRaster.GetArray(dataset, *chunk)
            arr[yOff:yOff + yWidth, xOff:xOff + xWidth] = chunkArr
    return arr

def Buffer(arr, dist, ring=False, full=True):
    '''Applies a buffer to any non-zero raster cells'''
    arr = arr.astype(bool)
    nzY, nzX = np.nonzero(arr)
    minY = np.amin(nzY)
    maxY = np.amax(nzY)
    minX = np.amin(nzX)
    maxX = np.amax(nzX)
    if minY - dist < 0:
        arr = np.vstack((np.zeros((abs(minY - dist), arr.shape[1]), bool),
                         arr))
    if maxY + dist >= arr.shape[0]:
        arr = np.vstack((arr,
                         np.zeros(((maxY + dist - arr.shape[0] + 1), arr.shape[1]), bool)))
    if minX - dist < 0:
        arr = np.hstack((np.zeros((arr.shape[0], abs(minX - dist)), bool),
                         arr))
    if maxX + dist >= arr.shape[1]:
        arr = np.hstack((arr,
                         np.zeros((arr.shape[0], (maxX + dist - arr.shape[1] + 1)), bool)))
    if dist >= 0: buffOp = binary_dilation
    else: buffOp = binary_erosion
    bufDist = abs(dist) * 2 + 1
    k = np.ones((bufDist, bufDist))
    bufArr = buffOp(arr, k)
    return bufArr.astype(int)
Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
Brian
  • 2,702
  • 5
  • 37
  • 71
  • [in this answer](http://stackoverflow.com/a/16597695/832621) it is explained how to deal with discontinuous data in memmaps, maybe it will help you out... – Saullo G. P. Castro Sep 30 '13 at 16:15
  • @SaulloCastro - Thanks for the link. I'm not really sure how that applies here. I don't consider my situation to be related to discontinuous data, but I am new to numpy so I may be wrong. – Brian Sep 30 '13 at 16:25
  • I posted the link to that question since it gives some information about how to work with offsets into a `memmap`, in order to access a given data block, which will be needed in your case. – Saullo G. P. Castro Sep 30 '13 at 18:44
  • So are you suggesting that I save the outputs of each operation to the same memmap but with a different offset? – Brian Sep 30 '13 at 18:45
  • Yes, you can even access the same memmap simultaneously using different processes running in parallel – Saullo G. P. Castro Sep 30 '13 at 18:47
  • how large is large? and what is the end-goal you want to reach with your code? – usethedeathstar Oct 01 '13 at 07:36
  • @usethedeathstar - The datasets being converted into numpy arrays are geographic datasets that could cover several hundred sq miles where each pixel is 1 sq meter. My code is mean to create a buffer around pixels representing certain features (i.e. schools, fire stations, etc) and then overlay those buffers on other features (i.e. forests). – Brian Oct 01 '13 at 14:32
  • @Brian I guess in that case just buying another 32Gb of RAM will not be enough for the job. Just thinking out loud here: isnt this a type of application you would use pycuda or something else for gpu-programming for? (just to speed up what you are doing on each separate small piece of the big image) – usethedeathstar Oct 01 '13 at 15:06
  • @usethedeathstar - gpu programming could be an option. It's not something I have any experience with. Although speed is important, at this point it isn't the issue. Right now, my script is using 99% of the physical memory on my machine just opening all the datasets. `memmaps` helped get a little further, but many of the operations create a new array and I'm not sure how to save the new arrays into a `memmap`. I am also trying to split the datasets into chunks for processing, but because the shape of the data may change, it is difficult to avoid edge effects. – Brian Oct 01 '13 at 15:19
  • @Brian if you just make sure your chunks overlap enough, you dont have that problem i guess, but than the problem becomes having to execute the same piece of code over and over again, but that can be done in parallel (on the same machine or on multiple machines as well, depending on the size of the problem). Just let each thread read in a different chunk of the data, and abuse shortcuts like += and so on, to overwrite the same memoryspaces, since at a certain moment the memory layout becomes the bottleneck. How many Gb is one datafile? – usethedeathstar Oct 01 '13 at 15:28

1 Answers1

1

Let me try to answer the first part of your question. Loading a result into a memmap datastore.

Note I am going to assume that there is a memmap file already on disk - it will be the input file. Called MemmapInput, created as follows:

fpInput = np.memmap('MemmapInput', dtype='bool', mode='w+', shape=(3,4))
del fpInput
fpOutput = np.memmap('MemmapOutput', dtype='bool', mode='w+', shape=(3,4))
del fpOutput

In your case the output file might not be present, but per the docs: 'r+' Open existing file for reading and writing.

‘w+’ Create or overwrite existing file for reading and writing.

So the first time you create a memmap file it must be with a 'w+', thereafter to modify/overwrite the file, use 'r+', Readonly copies can be obtained with 'r'. See http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html for more info.

Now we will read in this file and perform some operations on it. The main point is to load a result into a memamp file, the memmap file must first be created and attached to a file.

fpInput = np.memmap('MemmapInput', dtype='bool', mode='r', shape=(3,4))
fpOutput = np.memmap('MemmapOutput', dtype='bool', mode='r+', shape=(3,4))

Do whatever you want with the fpOutput memmap file e.g.:

i,j = numpy.nonzero(fpInput==True)
for indexI in i:
  for indexJ in j:
    fpOutput[indexI-1,indexJ] = True
    fpOutput[indexI, indexJ-1] = True
    fpOutput[indexI+1, indexJ] = True
    fpOutput[indexI, indexJ+1] = True
Paul
  • 7,155
  • 8
  • 41
  • 40