0

I have a nested loop with a huge range of data. In some point it takes hours to calculated the values. I was wondering if somehow I can speed it up by using multiproccessing package of python. Here is my code:

def update_selections(all_selection):
    selections_filtered_all = []
    selections_filtered_all_minus_1 = []
    for n, values in enumerate(all_selection):
         items_set = set()
         sum_length = 0
         for y in values:
             items_set.update(y)
             sum_length += 1
         if len(sum_length) == 300000000:
              selections_filtered_all.append(1)

    selections_filtered_all_minus_1.exted(selections_filtered_all)

By following this answer, this is my way, however it's not working:

def update_selections(all_selection):
    selections_filtered_all = []
    selections_filtered_all_minus_1 = []
    pool = Pool() 
    for n, x in enumerate(all_selection):
        pool.map(process_selections, x)
    
    selections_filtered_all_minus_1.exted(selections_filtered_all)


def process_selections(values):
    items_set = set()
    sum_length = 0
    for y in values:
        items_set.update(y)
        sum_length += 1
    if len(sum_length) == 300000000:
        selections_filtered_all.append(1)


    return essences_set, sum_length, selections_filtered_all

all_selection = ['xfRxx', 'asdeEFD', ...]
update_selections(all_selection)

I don't understand how to bring pool() in a loop. Any suggestion would be appreciated

J2015
  • 320
  • 4
  • 24
  • what do you think `pool.map(process_selections, x)` does? it creates a process for each value present in `x` and passes such value to the function specified (i.e. `process_selections`). So that function should process *one* value only. Note it may or may not make sense in your case to spawn a process to elaborate one simple value at a time. Maybe break things down in chunks differently from the original source `all_selection` – Pynchia Feb 10 '22 at 15:21
  • I suppose it is another loop and iteration which passes each element of x to process_selections. Maybe I need to replace x with all_selection and get rid of first loop? – J2015 Feb 10 '22 at 15:25
  • in `process_selections` try printing its argument `values`. You will see it is *one* value only – Pynchia Feb 10 '22 at 15:29
  • Also, let each process return a value to `Pool.map` (or more than one value) so that the main process can append it to the `selections_filtered_all` list. You cannot expect each process to do it. There are other ways of course (see [the docs](https://docs.python.org/3/library/multiprocessing.html#pipes-and-queues)), but they're unnecessary – Pynchia Feb 10 '22 at 15:39
  • I am not sure if i'm understanding correctly. Because even with your changes all the program blocked – J2015 Feb 10 '22 at 15:54

1 Answers1

0

Try this (but see my notes following):

from multiprocessing import Pool


def process_selection(values):
    items_set = set()
    sum_length = 0
    for y in values:
        items_set.update(y)
        sum_length += 1
    return 1 if len(sum_length) == 300000000 else None
    
def update_selections(all_selection):
    selections_filtered_all_minus_1 = []
    pool = Pool()
    # Just keep the 1 values:
    selections_filtered_all = list(filter(lambda x: x == 1, pool.map(process_selection, all_section)))
    # You mean extend, right?
    selections_filtered_all_minus_1.extend(selections_filtered_all)

Notes

There is something a little odd about function process_selection (and your original logic). It creates a set that is never used to any purpose. In essence you are just counting the number of elements in variable values and you can save a lot of time by not creating the set and adding elements to it. In fact, if values implements the len method, you do not even need a loop.

If, however, the intention was to get a count of the number of elements in values after removing duplicates, then you should be taking the length of the set:

def process_selection(values):
    items_set = set(values)
    return 1 if len(items_set) == 300000000 else None

This makes the function far less CPU intensive than before and it becomes less clear to what extent multiprocessing will improve performance. The problem is that it appears that you are passing very large data items across address spaces, which in itself carries large overhead.

I am also assuming that the number of elements in the all_selection iterable is not very large even if each of these elements may be quite large. If this assumption is erroneous, you might wish to look into using the imap method instead of the map methods with a suitable chunksize argument. Doing so will prevent the creation of a large list result which is only going to be filtered into another list (imap instead returns an iterable that when iterated gives each result one by one and would be more memory efficient and perfectly suitable as input to filter).

Booboo
  • 38,656
  • 3
  • 37
  • 60