0

I'm trying to run this multiprocessing pool and I can't figure out why it won't run. It just seems to be processing endlessly. I am confident the function I am calling works (I have tested without the pool) so the error seems to be here. Any thoughts? The code runs as far as the for loop based on what prints.

The function it is calling runs rasterstats.zonal_stats if that matters.

if __name__ == "__main__":
    #create and configure the process pool
    print('inside', flush=True)
    with multiprocessing.Pool(processes = multiprocessing.cpu_count()) as pool:
        print('inside2', flush=True)
        #prepare arguments for function as list of variables and bands
        items = [(var, band) for var in clim_rasts.keys() for band in bands]
        print('inside3', flush=True)
        #concat the results
        stime2 = time.time()
        for result in pool.starmap(main_climate_task, items):
            print('result', result, flush=True)
            climate_data = pd.concat([climate_data, result]) )
        etime2 = time.time()
        dur2 = etime2-stime2
        print(dur2, flush=True)
SturgeonNW
  • 15
  • 4
  • Try with one or two items only in `items` and/or with only one process in the pool to gather further information. Also you can use `print`s in `main_climate_task` to narrow down the problem. – Michael Butscher Feb 27 '23 at 11:12
  • 1
    To be clear, when you say "The code runs as far as the for loop based on what prints.", you mean you can actually see multiple results from the line `print('result', result, flush=True)` being printed? – 9769953 Feb 27 '23 at 11:23
  • The code might be more straightforward with `results = pool.map(main_climate_task, items)`, then `climate_data = pd.concat(results)`. That will just gather all the resulting dataframes from your task in a list, then concatenate them in one go. The for loop isn't necessary (since it'll be implicit). – 9769953 Feb 27 '23 at 11:25
  • A bit more detail: I have tried one tuple (var,band) which I understand should provide only 1 result, but nothing turns up. The print 'result' never comes as it doesn't get past the pool.starmap (it doesn't even go into the function as I have a print statement at the start of that as well). Regarding dropping the for loop, I tried that as well, but it has the same effect (does not complete the pool.map). I put it in the for loop hoping to see at least some of the results. – SturgeonNW Feb 28 '23 at 02:04

2 Answers2

1

Have you considered imap?

You only need starmap if you have multiple arguments to the function. But you have only a single argument, that happens to be a tuple. So you can use the non-star methods.

Have you considered imap_unordered?

That allows each result to feed out immediately, rather than wait to come out in the correct order.

Try changing

    for result in pool.starmap(main_climate_task, items):
        print('result', result, flush=True)
        climate_data = pd.concat([climate_data, result]) )

To

    for result in pool.imap_unordered(main_climate_task, items):
        print('result', result, flush=True)
        climate_data = pd.concat([climate_data, result]) )

If this at least starts to output results, that is good news.

In general I try to use imap_unordered whenever possible, because it makes maximal use of all your cores. Any core that finishes its work is immediately given the next item to work on, because it is always possible to write the output, rather than a core having to wait for the completion of earlier-listed items by other cores.

If ordering of results matters

As pointed out by @9769953, you can use pool:

results = pool.map(main_climate_task, items) 
climate_data = pd.concat(results)

I have a bias to using imap_unordered, because I like seeing the results coming out "as early as possible". Moreover you are having a puzzle of receiving no results, i.e. you are not sure if (a) one of your items is not finishing, (b) some of your items are not finishing, (c) none of your items are not finishing. The imap_unordered makes it easy to distinguish these possibilities.

For my approach there is a little work to do if you need climate_data to be in the order of items. For example, into the items = [(var, band)..., you could insert an iterator:

items = [(var, band, i_var, i_band) for var,i_var in enumerate(clim_rasts.keys()) for band,i_band in enumerate(bands)]

Then arrange for main_climate_task to pass back the i_band and i_var into result, so you can re-order the result entries into the desired order.

ProfDFrancis
  • 8,816
  • 1
  • 17
  • 26
  • 1
    Why not simply `results = pool.map(main_climate_task, items)`, then `climate_data = pd.concat(results)`, for the section "if ordering of results matter"? – 9769953 Feb 27 '23 at 11:26
  • Thanks for the idea. I have now tried the imap_unordered approach and have the same issue, it just doesn't seem to do anything towards calling the function. I also tried the pool.map option with the same outcome. I can call the function without multiprocessing with just a simple test ie. test = main_climate_task('var',1) and it runs. Regarding the tuple comment, perhaps that is something going on? My function is defined like this def main_climate_task(var, band): does that not mean I have two arguments? do I need to divide the items somehow? – SturgeonNW Feb 28 '23 at 02:11
  • Thanks Eureka, I had incorrectly set up the var and band as two arguments for the function, but was trying to pass a single argument tuple to the function. I have fixed that up now I think by using def main_climate_task(var_band): and then unpacking the tuple inside the function using var, band = var_band. However, this still does not appear to have resolved the issue. As far as I can tell the function still doesn't start as the first print line inside the function doesn't run. – SturgeonNW Feb 28 '23 at 02:24
  • Sorry, I gave you a wrong hint. Starmap automatically converts tuples in the list, into a series of separate parameters when it calls the target function. Moving on to your remaining question, have you tried removing the entire contents of the original target function and making it just a simple "return" of a constant? That will help narrow down the problem. – ProfDFrancis Feb 28 '23 at 02:31
  • @Eureka I tried defining the fuction 'def main_climate_task(var_band): print(1) var = 1 return var' and then running each of the above versions (starmap, imap_unordered and simply map) and none of them get to the print(1) or return any items. It just shows the asterisk as if the cell is still running. – SturgeonNW Feb 28 '23 at 02:47
  • 1
    So there is something seriously wrong with multiprocessing on your system. Can you try a simple multiprocessing example from a tutorial, and see if it works? – ProfDFrancis Feb 28 '23 at 03:42
  • 1
    Ok you have hit something here, it seems it is an environment issue. I am using Windows, and JupyterLab. I tried it on Spyder and it worked fine. I now found this post (https://stackoverflow.com/questions/48846085/python-multiprocessing-within-jupyter-notebook) talking about the issues of multiprocessing with Windows and JupyterLab, but I am struggling to get any of the solutions presented there to work in. I would like to continue to use JupyterLab if possible as all of my other code is there and I find the interface easier than Spyder. – SturgeonNW Mar 01 '23 at 04:01
0

All due thanks to @Eureka for pointing out the environment may be the issue. I found out here (Python Multiprocessing within Jupyter Notebook) that multiprocessing will not run in JupyterLab on Windows. I saved my code in a .py file and ran that and it works fine.

%%writefile multiprocess.py

(ALL THE CODE GOES HERE)

%run multiprocess.py
SturgeonNW
  • 15
  • 4