How to pass a huge amount of combinations from a nested loop efficiently to a multiprocessing pool without running out of memory?

Question

Right now I'm working on a task where I have create a huge symmetrical numpy array (e.g. a 1788x1788 matrix), calculate the values for the upper triangle (as it's symmetrical, it saves me 50% time) and them mirror it down to the under triangle.

Right now, my code looks like this:

    # Create an array with the correct size
    matrix_M = np.zeros((numb_functions, numb_functions))

    # For each value in the upper triangle, calculate the wanted value and insert it into the matrix
    for k in range(len(matrix_M)):
        for l in range(k, len(matrix_M)):
            matrix_M[k, l] = matrix_m_value_calculation(k, l, list_function_hashes, functions_dict)

    # Mirror the upper triangle to the under triangle to get the complete matrix
    inds = np.triu_indices_from(matrix_M, k=1)
    matrix_M[(inds[1], inds[0])] = matrix_M[inds]

This works fine so far! It takes a while, but it gets the job done.

But since it takes so long, the idea was to create a Pool and call matrix_m_value_calculation with all possible combinations of k,l and use the results to fill the matrix_M. However, this got me into a couple of issues as:

Using apply_async so many times on the same pool uses lots of memory, resulting in memory issues.
Creating a new Pool for each iteration solves the memory issue, but is slower than not using multiprocessing at all (the penalty of spawning/forking a new process is bigger than the run time of the function itself)
map/starmap wants to have an iterable. Here, I'm not sure how to recreate this nested loop in one statement (maybe this is the solution?) and passing all possible combinations of k,l also results requires way too much memory for this to actually work.
Putting all combinations of k,l in a multiprocessing.Queue to pass to another process takes way less memory than the other approach, but it's still unfeasible and I run out of memory eventually.

So, my current idea is that I need to combine this nested loop in one iterable that I can pass to a Pool without running out of memory. What's a good approach to solve this?

Why don't you create like 8 threads each only treating every 8th k? i.e. `for i in range(len(matrix_M))/8: i*8...`, the second one `i*8+1` etc. — Nearoo, Nov 13 '19 at 15:30
I'm only processing a triangle of the matrix, not the whole one. I'm not sure if I'm wrong here, but based on my thoughts this would result in a bunch of threads being done early (the number of rows to process in a column increases for each k) while still having to wait for the others to finish. Passing each value in a process would eliminate this issue and use my CPU to the fullest. At least, this was what I had in mind. — Nirusu, Nov 13 '19 at 15:37
Your CPU most likely only has 8 threads, without advanced parallel programming knowledge I recommend just creating at most 8 threads that have about the same workload. If you're just naively creating more threads you're introducing overhead when the CPU has to switch context to run all your threads at approximately the same speed, _serially_ , because it can't run more than 8 in parallel. — Nearoo, Nov 13 '19 at 15:51
The simplest way to kind of distribute the workflow evenly is to give each thread approximately the same amount of cells the have to calculate. They want be done at _exactly_ the same time, but approximately. — Nearoo, Nov 13 '19 at 15:54
As you see with `inds`, the loops don't need to be nested. You could do `for k,l in zip(indx):` loop. It doesn't reduce the number of iterations, but may make reasoning/experimenting easier. It may also be worth your while to look at the code behind the `np.triu...` functions to see how those indices are generated. — hpaulj, Nov 13 '19 at 17:54
@Nearoo I know, that's what I exactly trying to achieve: a balanced workload between all threads/processes. But since I'm only working on a triangle, this is kinda difficult to achieve simply by creating a new job for the Pool based on indices. — Nirusu, Nov 14 '19 at 07:40
@hpaulj That seems like a step forward, but yet I cannot figure out how to make this work. Do you mean, I should use `indx = np.triu_indices_from` and then the for loop you proposed? Because if I try it like this, I get "not enough values to unpack (expected 2, got 1)" as an error. Otherwise, replacing `indx` with the two range statements does not work because the second statement contains `k`, resulting in "local variable 'k' reference before assignment". Sorry, if I'm having difficulties following here. It's my first time working with iterators in such an extent. — Nirusu, Nov 14 '19 at 07:43
@Nirusu Nah cmon, that's a basic programming problem, you'll figure it out! Think of it like this, the more problems you solve, the faster you become at problem solving, you can only benefit from this. SO is not here to solve your problems, it's here to answer questions, but I can give you some hints: Try to create a function that takes n for thread number n and returns an iterator that this thread should treat. Think of the matrix as one long, one dimensional array. — Nearoo, Nov 14 '19 at 10:19
Also, mirroring your matrix in memory is a huge waste of space. You could build a Matrix wrapper than only stores one half but appears to be whole. Or, even better, you should use a Matrix library such as [Eigen](https://pybind11.readthedocs.io/en/stable/advanced/cast/eigen.html). There is a whole field of research dedicated to the efficient and accurate solution of numerical problems, and the solution are often really complex and non-trivial, and result in sometimes (literally) 1000x more efficient than naive approaches - if you're serious, you should use libs. — Nearoo, Nov 14 '19 at 10:20

How to pass a huge amount of combinations from a nested loop efficiently to a multiprocessing pool without running out of memory?

0 Answers0