0

I have an image with a large number of bands and I want to perform some operation on each band, then output it to a new image. To speed things up, I want to use multiprocessing in a way that every process is working on one image band, but I get a pickling error (TypeError: cannot pickle 'SwigPyObject' object) and now I don't know if my idea is even possible.

Here is my code:

import os
import time
import itertools
import multiprocessing as mp
import numpy as np
from osgeo import gdal


def worker_function(src, band, dst):
    data = src.GetRasterBand(band).ReadAsArray()
    # do more here later
    print('\t ... wrrrrrrrrmmmmmm ...')
    dst.GetRasterBand(band).WriteArray(data)
    return


def wrapper(params):
    return worker_function(*params)


if __name__ == '__main__':
    start = time.time()
    cores = 3
    img_in = r'c:\Users\myname\input.tif'
    img_out = r'c:\Users\myname\output.tif'
    ### whole bunch of necessary raster creation things here ###
    if os.path.exists(img_out):
        os.remove(img_out)
    ds = gdal.Open(img_in, gdal.GA_ReadOnly)
    print(f'Raster Size: {ds.RasterXSize, ds.RasterYSize}')
    print(f'Number of bands: {ds.RasterCount}')
    print('Creating output file ...')
    driver = gdal.GetDriverByName('GTiff')
    out = driver.Create(img_out, ds.RasterXSize, ds.RasterYSize, ds.RasterCount, gdal.GDT_Byte)
    out.SetProjection(ds.GetProjection())
    out.SetGeoTransform(ds.GetGeoTransform())
    print(f'Setting up multiprocessing for {cores} cores ...')
    mp.freeze_support()
    print('Starting the pool ...')
    ### this is the important part ###
    with mp.Pool(processes=cores) as pool:
        pool.map(wrapper, zip(itertools.repeat(ds), [b for b in range(1, ds.RasterCount + 1)], itertools.repeat(out)))
    out = None
    ds = None
    print(f'\nDone! Took me {np.round(time.time() - start, 4)}s.')

Since ds and out are Swig Object of type 'GDALDatasetShadow *', I get that pickling error. According to the answer in Python multiprocessing PicklingError: Can't pickle <type 'function'> I tried to replace pool.map with pool.map_async:

pool.apply_async(worker_function, zip(itertools.repeat(ds), [b for b in range(1, ds.RasterCount + 1)], itertools.repeat(out)))

But then I only get an empty result image, no prints and it finishes after not even half a second and no prints, which indicates to me that worker_function is not even entered. What am I missing here?

s6hebern
  • 725
  • 9
  • 18
  • 1
    This is probably not a good idea. You can read in parallel, but it probably still requires opening a separate read-only Dataset on each worker (or thread). I don't think writing in parallel is even supported by GDAL, in my experience writing is also rarely a bottleneck. – Rutger Kassies Oct 17 '22 at 06:42
  • Is it just a bad idea or is it simply not possible, from a technical standpoint? – s6hebern Oct 17 '22 at 08:55
  • 1
    I think you should at least open and close the Dataset on each worker, instead of using a single instance. For reading only, using a single Dataset _might_ work if you use multi-threading. This recent MR might also be interesting for you, if you're willing to wait a bit until it's released: https://github.com/OSGeo/gdal/pull/6438 – Rutger Kassies Oct 17 '22 at 11:26

0 Answers0