I have written a Python script for tiling imagery using the GDAL open source library and the command line utilities provided with that library. First, I read an input dataset that tells me each tile extent. Then, I loop through the tiles and start a subprocess to call gdalwarp in order to clip the input image to the current tile in the loop.
I don't to use Popen.wait() because this will keep the tiles from being processed concurrently, but I do want to keep track of any messages returned by the subprocess. In addition, once a particular tile is done being created, I need to calculate the statistics for the new file using gdalinfo, which requires another subprocess.
Here is the code:
processing = {}
for tile in tileNums:
subp = subprocess.Popen(['gdalwarp', '-ot', 'Int16', '-r', 'cubic', '-of', 'HFA', '-cutline', tileIndexShp, '-cl', os.path.splitext(os.path.basename(tileIndexShp))[0], '-cwhere', "%s = '%s'" % (tileNumField, tile), '-crop_to_cutline', os.path.join(inputTileDir, 'mosaic_Proj.vrt'), os.path.join(outputTileDir, "Tile_%s.img" % regex.sub('_', tile))], stdout=subprocess.PIPE)
processing[tile] = [subp]
while processing:
for tile, subps in processing.items():
for idx, subp in enumerate(subps):
if subp == None: continue
poll = subp.poll()
if poll == None: continue
elif poll != 0:
subps[idx] = None
print tile, "%s Unsuccessful" % ("Retile" if idx == 0 else "Statistics")
else:
subps[idx] = None
print tile, "%s Succeeded" % ("Retile" if idx == 0 else "Statistics")
if subps == [None, None]:
del processing[tile]
continue
subps.append(subprocess.Popen(['gdalinfo', '-stats', os.path.join(outputTileDir, "Tile_%s.img" % regex.sub('_',tile))], stdout=subprocess.PIPE))
For the most part, this works for me, but the one issue I am seeing is that it seems to create an infinite loop when it gets to the last tile. I know this is not the best way to do this, but I am very new to the subprocess module and I basically just threw this together to try and get it to work.
Can anyone recommend a better way to loop through the list of tiles, spawn a subprocess for each tile that can process concurrently, and spawn a second subprocess when the first completes for each tile?
UPDATE: Thanks for all the advice so far. I tried to refactor the code above to take advantage of the multiprocessing module and Pool.
Here is the new code:
def ProcessTile(tile):
tileName = os.path.join(outputTileDir, "Tile_%s.img" % regex.sub('_', tile))
warp = subprocess.Popen(['gdalwarp', '-ot', 'Int16', '-r', 'cubic', '-of', 'HFA', '-cutline', tileIndexShp, '-cl', os.path.splitext(os.path.basename(tileIndexShp))[0], '-cwhere', "%s = '%s'" % (tileNumField, tile), '-crop_to_cutline', os.path.join(inputTileDir, 'mosaic_Proj.vrt'), tileName], stdout=subprocess.PIPE)
warpMsg = tile, "Retile %s" % "Successful" if warp.wait() == 0 else "Unsuccessful"
info = subprocess.Popen(['gdalinfo', '-stats', tileName], stdout=subprocess.PIPE)
statsMsg = tile, "Statistics %s" % "Successful" if info.wait() == 0 else "Unsuccessful"
return warpMsg, statsMsg
print "Retiling..."
pool = multiprocessing.Pool()
for warpMsg, statsMsg in pool.imap_unordered(ProcessTile, tileNums): print "%s\n%s" % (warpMsg, statsMsg)
This is causing some major problems for me. First of all, I end up with many new processes being created. About half are python.exe and the other half are another gdal utility that I call before the code above to mosaic the incoming imagery if it is already tiled in another tiling scheme (gdalbuildvrt.exe). Between all the python.exe and gdalbuildvrt.exe processes that are being created, about 25% of my CPU (Intel I7 with 8 cores when hyperthreaded) and 99% of my 16gb of RAM are in use and the computer completely hangs. I can't even kill the processes in Task Manager or via command line with taskkill.
What am I missing here?