1

I need to take a massive list of lists and remove lists that are "unfit".

When using Pool.apply_async, task manager claims to be using only around 10% cpu and 97% memory and the whole process takes forever.
I am not very knowledgeable on this, but if I am using all my cores, I feel as though it should be using more than 10% cpu.
So my questions are as follows:

  1. Is Pool.apply_sync the best way to accomplish my goal? I feel like going back to the main process each time I want to remove an item via the callback is adding too much time/overhead.
  2. What is causing the extreme use of memory?

Here is an example of my code using a smaller list to demonstrate

w_list = [[1, 0, 1], [1, 1, 0], [1, 1, 1]]
budget = 299
cost = [100, 100, 100]

def cost_interior(w):

    total_cost = 0
    for item in range(0, len(w)):
        if w[item] == 1:
            total_cost = total_cost + cost[item]

    if total_cost > budget or total_cost < (0.5 * budget):
        w_list.remove(w)

def remove_unfit(unfit):
    if unfit is not None:
        w_list.remove(unfit)

if __name__ == "__main__":

    p = Pool(2)
    for w in w_list:
        p.apply_async(cost_interior, args=(w,), callback=remove_unfit)

    p.close()
    p.join()

    print(w_list)
Anmol Singh Jaggi
  • 8,376
  • 4
  • 36
  • 77
rml710
  • 11
  • 2
  • If you have a "massive list of lists", why are you surprised that it takes up a lot of memory? Probably coordinating work across CPUs is adding more overhead, not reducing processing time at all. – tripleee Jul 27 '20 at 03:52
  • So you want to spin up a separate "parallel" process for each element of the massive list??? Not a very good idea. BTW: processes or threads? – Pynchia Jul 27 '20 at 03:58
  • Have s look at [this QA](https://stackoverflow.com/questions/3033952/threading-pool-similar-to-the-multiprocessing-pool) and [this QA](https://stackoverflow.com/questions/8533318/multiprocessing-pool-when-to-use-apply-apply-async-or-map) – Pynchia Jul 27 '20 at 04:04

2 Answers2

1

You will achieve much better performance by using Pool.map(function, iterable) which splits the iterable (w_list in this case) into multiple chunks and applies the function to each of the chunks with one thread for each chunk.

One more critical optimization is to not call remove() on the list repeatedly as its a very expensive operation. Instead we can first store the list of indices that we want to remove and then create a new list.

I have tested the following code and it does seem to run much faster (around 3-4x) than compared to single threaded (you can uncomment the process_pool = mp.Pool(1) to see the difference).

import multiprocessing as mp

def cost_interior(w):
    budget = 299
    cost = [100, 100, 100]
    total_cost = 0
    for item in range(0, len(w)):
        if w[item] == 1:
            total_cost = total_cost + cost[item]
    if total_cost > budget or total_cost < (0.5 * budget):
        return True
    return False


def main():
    process_pool = mp.Pool(mp.cpu_count())
    #process_pool = mp.Pool(1)
    w_list = [[1, 0, 1], [1, 1, 0], [1, 1, 1]]
    w_list = w_list*1000000
    should_remove = process_pool.map(cost_interior, w_list)
    process_pool.close()
    process_pool.join()
    should_remove_indices = set()
    for i in range(len(w_list)):
        if should_remove[i]:
            should_remove_indices.add(i)
    w_list_new = []
    for i in range(len(w_list)):
        if i not in should_remove_indices:
            w_list_new.append(w_list[i])

if __name__ == "__main__":
    main()
Anmol Singh Jaggi
  • 8,376
  • 4
  • 36
  • 77
0

Unfortunately there probably isn't a good way to do this.

The issue you run into with python multiprocessing is that it works by creating a pool of additional processes. These processes are copies of the original so you often end up with NUM_PROCS copies of your data, 1 for each processes. There are some caveats here, but if you see your memory go way up, likely that's due to extra copies of your data.

In addition, for python to communicate between processes, it needs to serialize your arguments, pipe it to the worker, then serialize the response back. In your above example there's very few clock cycles needed to do the processing in the worker. It's probably taking longer to pickle the data and send it than is spent in actual worker processing. If you don't see processing time go down as you increase the size of the pool, that's likely what's happening.

You can experiment with breaking up the code in different ways to see if you can get something to work but, given the example above, I think it's unlikely you'll get a speedup. There are a few different pool functions you can try (I like pool.imap) but the underlying problem is the same for all of them.

You can read about the issues with multiprocessing and the Global Interpreter Lock online. I find python multiprocessing very useful when sub-tasks take a while but for very small tasks, the overhead is too high.

bivouac0
  • 2,494
  • 1
  • 13
  • 28