1

I have a list of files that need to be preprocessed using just one command before being mosaicked together. This preprocessing command uses third-party software via system call to write to a geoTIFF. I wanted to use multi-threading so that the individual files can be pre-processed at the same time and then, once all individual files are processed, the results can be mosaicked together.

I have never used multi-threading/parallel processing before, and after hours of searching on the internet, I still have no clue what the best, simplest way to go about this is.

Basically, something like this:

files_list = # list of .tif files that need to be mosaicked together but first, need to be individually pre-processed

for tif_file in files_list:
    # kick the pre-processing step out to the system, but don't wait for it to finish before moving to preprocess the next tif_file

# wait for all tiffs in files_list to finish pre-processing
# then mosaick together

How could I achieve this?

user20408
  • 679
  • 2
  • 7
  • 23
  • What is the output of the pre-processing? – Peter Wood Sep 22 '16 at 15:45
  • Any reason this task should be parallelized? Doing these files one after another would be definetely much faster (except few special cases) due to overhead python has for multithreading. – Tomasz Plaskota Sep 22 '16 at 15:49
  • @PeterWood the output of the pre-processing step are geoTIFFs that I need to mosaic together – user20408 Sep 22 '16 at 15:51
  • Are geoTIFFs files or in memory? – Peter Wood Sep 22 '16 at 15:52
  • @TomaszPlaskota Well, the purpose was to make the code faster, hah. Can you explain in more detail? How do you know that that is the case? Thx – user20408 Sep 22 '16 at 15:52
  • @PeterWood They are actual files – user20408 Sep 22 '16 at 15:53
  • Well there are many detailed answers to this already on SO. Try this: http://stackoverflow.com/a/10789458/6313992 . In short, if you are just doing python operations you are much better jusing doing them in main porgram. Special cases include: obtaining responses over http/sockets, writing to files, any blocking io... In detail explanation: https://42bits.wordpress.com/2010/10/24/python-global-interpreter-lock-gil-explained-pycon-tech-talk/ – Tomasz Plaskota Sep 22 '16 at 15:56
  • @TomaszPlaskota Thank you. I should have clarified but I'm actually using a system call to do the preprocessing, it's not a command that is native to Python. And yes the preprocessing step writes files using a third-party software that has a multi-threading option. I just didn't know how to wait for that step to be complete for all input files before moving on. – user20408 Sep 22 '16 at 16:19
  • @TomaszPlaskota I edited my question to add more details about this – user20408 Sep 22 '16 at 16:47

2 Answers2

0

See the multiprocessing documentation.

from multiprocessing import Pool

def main():
    pool = Pool(processes=8)
    pool.map(pre_processing_command, files_list)

    mosaic()

if __name__ == '__main__':
    main()
Peter Wood
  • 23,859
  • 5
  • 60
  • 99
0

if you need to use multiple processor cores you should use multiprocess, in the most simple case you can use something like:

def process_function(tif_file):
    ... your processing code here ...

for tif_file in files_list:
    p = Process(target=process_function, args=(tif_file))
    p.start()
    p.join()

You need to take care, because so many process running at same time can overpass the PC resources, you can look here and here for solutions to the problem.

You can also use threading.thread, but it uses only one processor core, and is restricted by the Global Interpreter Lock

Community
  • 1
  • 1
Cesar
  • 707
  • 3
  • 13