0

I have been looking to parallize some tasks in python but did not find anything useful. Here is the pseudo code for which I want to use parallelization:

# Here I define a list for the results. This list has to contain the results in the SAME order.
result_list = []

# Loop over a list of elements. I need to keep that loop. I mean,. in the final code this loop must be still there for specific reasons. Also, the results need to be stored in the SAME order. 
for item in some_list:

    # Here I use a method to process the item of the list. The method "task" is the function I want to parrallize
    result = task(item)

    # Here I append the result to the result list. The results must be in the SAME order as the input data
    result_list.append(result)

I want to parallelize the method task which takes a single item, processes it, and returns some results. I want to collect those results in the same order as in the original list.

The results in the final list result_list has to be in the same order as the items in the input list.

Alex
  • 41,580
  • 88
  • 260
  • 469
  • 2
    your specifications are unclear, as parts of your post contradict each other... You ask `Or can I just do something like` and provide code that has `results = pool.map(process, alldata)`, but ask `how to get the results? `. Further down, you mention that you specifically want to use some kind of for loop, so the code you asked about in the middle would not be appicable anyway? – FlyingTeller Jun 30 '23 at 06:53
  • Sorry, I am not an expert in using multirpcessing in python! Sure I can make it more clear... – Alex Jun 30 '23 at 06:54
  • And your result has to be in order of input I assume? – FlyingTeller Jun 30 '23 at 06:54
  • @FlyingTeller Yes, as I have written two times. I will mention it again to be sure – Alex Jun 30 '23 at 06:56
  • Does `pool=ThreadPool()` followed by `result_list = list(pool.map(task, some_list)` work for you? Afterwards, you would just call `pool.close()` – FlyingTeller Jun 30 '23 at 06:59
  • @FlyingTeller This cannot work, as I have no list. I only have a SINGLE ITEM. Please read my question – Alex Jun 30 '23 at 07:01
  • I also get an error: `ImportError: cannot import name 'ThreadPool' from 'multiprocessing'` – Alex Jun 30 '23 at 07:02
  • Do you know the number of items you need to process from the start? – FlyingTeller Jun 30 '23 at 07:03
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254308/discussion-between-flyingteller-and-alex). – FlyingTeller Jun 30 '23 at 07:03
  • Sure, I will rewrite the whole code and use the list of items... – Alex Jun 30 '23 at 07:04
  • 1
    Does this answer your question? [Is there a simple process-based parallel map for python?](https://stackoverflow.com/questions/1704401/is-there-a-simple-process-based-parallel-map-for-python) – mkrieger1 Jun 30 '23 at 07:11
  • The concurrent.futures module gives you both ThreadPoolExecutor and ProcessPoolExecutor. They are interchangeable in the sense that any code using the first of these classes can also be used with the second. What you have to figure out is the best option - threading or multiprocessing. The former is ideal for I/O bound activity whereas the latter is better for CPU-bound activity. In the case of sub-processing you also need to consider how you're to transfer information between client and server. If the object(s) to be moved can be pickled then you're OK - otherwise it gets a bit complicated – DarkKnight Jun 30 '23 at 07:32

2 Answers2

2

If you do need a for loop, then you could use ThreadPoolExecutor, which let's you submit your jobs to a pool of Threads and collect results afterwards:

from concurrent.futures import ThreadPoolExecutor
result_list=[]
with ThreadPoolExecutor(max_workers=16) as executor:
    futures=[]
    for item in some_list:
        future = executor.submit(task, item)
        futures.append(future)
    for f in futures:
        result_list.append(f.result())

ThreadPoolExecutor is interchangable with ProcessPoolExecutor, which could be a better choicce for CPU intensive tasks. If in doubt, try both and benchmark for your specific usecase. The complete docs can be found here

Note that depending on the number of items, it can be quite inefficient to do append many times. If you know the number of items before, then I suggest that you create the lists in the correct length from the start

If you would arange the input items in a list, you could also do

from multiprocessing import Pool

with Pool() as P:
    result_list= list(P.map(task, some_list))
FlyingTeller
  • 17,638
  • 3
  • 38
  • 53
0

You can implement parallel processing easily with either threads or multi-processing.

There are no "rules" to guarantee which mechanism is going to be best for a particular use-case. However, a good starting consideration is to think about what the sub-task (thread or process) is going to be doing. If it's CPU-intensive then multi-processing is probably the best choice. For I/O bound activity, threads may be better.

Here's a simple pattern that shows how multi-threading and multi-processing can be easily interchanged.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

USE_THREADS = False # set according to whether threads or processes are preferred

EXECUTOR = ThreadPoolExecutor if USE_THREADS else ProcessPoolExecutor

def task(n: int) -> int:
    return n * n

def main():
    with EXECUTOR() as executor:
        print(list(executor.map(task, range(10))))

if __name__ == '__main__':
    main()

Output:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
DarkKnight
  • 19,739
  • 3
  • 6
  • 22