Is parallelizing a nested set for loops to populate arrays a good multiprocessing problem?

Question

Is this a good paralyzation problem? I have 2 really large arrays [273x1025x2048] and have to calculate stuff with them to generate 3 other equally as large arrays:

'''
Calculate stuff with 2 different arrays (input), resulting in 3 arrays (output)
'''
import numpy as np

YYY = np.linspace(-90,90,1025)
XXX = np.linspace(0,360,2048)
ZZZ = np.linspace(0,600,273)

globalvar1 = np.random.rand(np.size(ZZZ),np.size(YYY),np.size(XXX))
globalvar2 = np.random.rand(np.size(XXX),np.size(YYY),np.size(ZZZ))

def calc1(y,z):
    GV1 = globalvar1[z,y,:] #array with shape (273, 1025, 2048)
    GV2 = globalvar2[:,y,z] #array with shape (2048, 1025, 273)

    OUT1 = np.exp(-GV1/GV2)
    return OUT1

def calc2(y,z):
    GV1 = globalvar1[z,y,:] #array with shape (273, 1025, 2048)
    GV2 = globalvar2[:,y,z] #array with shape (2048, 1025, 273)

    OUT1 = np.cos(-GV1/GV2)
    return OUT1
def calc3(y,z):
    GV1 = globalvar1[z,y,:] #array with shape (273, 1025, 2048)
    GV2 = globalvar2[:,y,z] #array with shape (2048, 1025, 273)

    OUT1 = np.sin(-GV1/GV2)
    return OUT1

output1aux,output2,output3=[],[],[]
for j in range(np.size(XXX)):
    output1aux,output2aux,outputaux=[],[],[]
    for k in range(np.size(YYY)):
        output1aux.append(calc1(j,k))
        output2aux.append(calc2(j,k))
        output3aux.append(calc3(j,k))
    output1.append(output1aux)
    output2.append(output2aux)
    output3.append(output3aux)
    print(XXX[j])
output1 = np.array(output1)
output2 = np.array(output2)
output3 = np.array(output3)

I tried the solution posted here: Parallelize these nested for loops in python, which involves creating wrappers for the indices and running each subroutine separately. I adapted my code to this:

import numpy as np
import multiprocessing 

YYY = np.linspace(-90,90,1025)
XXX = np.linspace(0,360,2048)
ZZZ = np.linspace(0,600,273)

globalvar1 = np.random.rand(np.size(ZZZ),np.size(YYY),np.size(XXX))
globalvar2 = np.random.rand(np.size(XXX),np.size(YYY),np.size(ZZZ))

def index_wrapper1(indices):
    calc1(*indices)
def index_wrapper2(indices):
    calc2(*indices)
def index_wrapper3(indices):
    calc3(*indices)

def calc1(y,z):
    GV1 = globalvar1[z,y,:] #array with shape (273, 1025, 2048)
    GV2 = globalvar2[:,y,z] #array with shape (2048, 1025, 273)

    OUT1 = np.exp(-GV1/GV2)
    return OUT1

def calc2(y,z):
    GV1 = globalvar1[z,y,:] #array with shape (273, 1025, 2048)
    GV2 = globalvar2[:,y,z] #array with shape (2048, 1025, 273)

    OUT2 = np.cos(-GV1/GV2)
    return OUT2
def calc3(y,z):
    GV1 = globalvar1[z,y,:] #array with shape (273, 1025, 2048)
    GV2 = globalvar2[:,y,z] #array with shape (2048, 1025, 273)

    OUT3 = np.sin(-GV1/GV2)
    return OUT3


def run():
    PROCESSES = 96
    print('Creating pool with %d processes\n' % PROCESSES)

    with multiprocessing.Pool(PROCESSES) as pool:
        XXX = np.arange(np.shape(globalvar2)[0])
        YYY = np.arange(np.shape(globalvar2)[1])
        ZZZ = np.arange(np.shape(globalvar2)[2])
        print('Creating empty array...')
        
        OUT1arr = np.zeros((np.size(XXX), np.size(YYY), np.size(ZZZ)))
        OUT2arr = np.zeros((np.size(XXX), np.size(YYY), np.size(ZZZ)))
        OUT3arr = np.zeros((np.size(XXX), np.size(YYY), np.size(ZZZ)))

        print('Arrays created!')
        print('Loading Pool...')
        
        for i in range(np.size(XXX)):
            print(XXX[i])
            for j in range(np.size(YYY)):
                OUT1arr[i,j,:] = pool.map(index_wrapper1,\
                                [(i,j,k) for k in range(ZZZ)])
                OUT2arr[i,j,:] = pool.map(index_wrapper2,\
                                [(i,j,k) for k in range(ZZZ)])
                OUT3arr[i,j,:] = pool.map(index_wrapper3,\
                                [(i,j,k) for k in range(ZZZ)])
        return [OUT1arr,OUT2arr,OUT3arr]
if __name__ == '__main__':
    OUTPUT  = run()
    print(np.shape(OUTPUT))

This seems to work not as fast as it should. The serial processing is behaving way faster. I thought this would be a good problem for the multiprocessing library. Am I missing something? As in, is this not a good paralyzation problem?

I'm using jupyter notebook, python 3, in a linux cluster. Thanks y'all!

First, multithreading and multiprocessing are two different things. Threads share the same memory, process aren't. So you'll need to pass data to processes, and since your initial array is around 500 MB, it may take time. Second, and there is no troll intended, "python" and "fast" are usually antinomics... It's a good tool to pass / sequence data, but computations are usually done in another language, like C, C++, etc. for performance reasons... Maybe try with multithreading instead, as a first try? — Wisblade, Mar 21 '22 at 21:56
Thanks for educating me. I understand python is slow by nature (not sure exactly why), but it's usually my go to because of the extensive available libraries and the team I work with. I will look into multithreading. If you know of a good place to start, I'd greatly appreciate it. Thank you! — Rafael Mesquita, Mar 22 '22 at 12:48
Python is slow (up to 100 times slower than C, sometimes...) because it's an interpreted language, dynamically typed, and with not-so-good memory/CPU management. That's a **very simplified** explanation, there is full of pages about the reasons all around Internet, and more interesting, _how to reduce this issue for specific needs_. But just out-of-the-box, Python is just a snail compared to low-level compiled languages... It can seems "fast" for a human, but at machine-level, it's slow. [This page](https://tonybaloney.github.io/posts/why-is-python-so-slow.html) can be a starting point. — Wisblade, Mar 23 '22 at 01:47
Thank you so much for the material... I sort of knew that Python was slow compared to others, but I wasn't aware of how slow that was. I guess I need to reevaluate my strategy here... Anyway, thanks again :) — Rafael Mesquita, Mar 24 '22 at 12:53
That's not mandatory. Python is heavily used in scientific computations, because it's easy to make, modify, etc. so most scientifics use it as a sequencer who distribute data to high-performance computation units (including supercomputers). The job itself is done with high performance, compiled languages, but all the scheduling is done in Python. So you're not fully wrong, it's a valid starting point. But now, you have to seek for optimizations for the heavy computations, that's all. — Wisblade, Mar 24 '22 at 13:39
I'm not sure I fully understand the concept, but that's my bad. I haven't worked with big data for long enough to make sense of the requirements. So you're saying that I can still use python to route the jobs in the cluster, but I should likely invest time in making it more efficient like passing the data as parameters in those functions? Or use Python to schedule jobs to be performed by something faster like C++? — Rafael Mesquita, Mar 24 '22 at 14:52
That's exactly the point. Optimize transfer to parallel routines, and optimize routines themselves (in Python for the little ones, in whatever is faster for the big ones). In one order or the other, depending on where your bottleneck is. — Wisblade, Mar 24 '22 at 14:55
Got it, I will invest some time in that... For what is worth, I think you responded to all of my questions here. The "Am I missing something?" is that I'm missing a lot of optimizations and I was also missing that Python will be slower by nature. Now the question is "What do I need to do to make this faster?", that is unanswered and I will submit a run overnight with a bunch of print(current_time) in as a diagnostic tool. Then I will know where the bottleneck(s) is (are) and address it(them)... — Rafael Mesquita, Mar 24 '22 at 15:06
Yep. And don't hesitate to ask another question when you will struggle with one detail. Note that I didn't say "IF", but "WHEN"... Be conscious that it WILL happen. But that's also the fun part of optimization... — Wisblade, Mar 24 '22 at 15:55
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/243305/discussion-between-rafael-mesquita-and-wisblade). — Rafael Mesquita, Mar 24 '22 at 21:03

Is parallelizing a nested set for loops to populate arrays a good multiprocessing problem?

0 Answers0