0

I have two arrays named arrayone & arraytwo, both of which are identical in dimensions, static, and will not need to be altered. A third array masterarray is pre-assembled, in which the compilation of integers are cast to a third array, then placed into the pre-assembled third array.

The process of moving along the ndarray's columns (j) and each row (i) is fast, however I'd like to utilize multiprocessing to accelerate this process and share these arrays without excessive memory consumption. Specifically, I'd like to execute multiple processes which loop across each column (j) at any given row (i) and writes the result to masterarray in shared memory. I've perused this answer, however the potential instability caused by sharedmem has led me to ask this question. For reference, my code is as follows:

def gridagg():
    masterarray = np.empty([1228,2606,208])
    for index, val in np.ndenumerate(arrayone):
        selection = arraytwo[index[0]][index[1]]
        piece = stacked[selection[:,0], selection[:,1]].tolist()
        piece = [j for i in piece for j in i]
        comparray = np.array(piece)
        if index[1] == 0:
            compiled = comparray
        else:
            stage1 = comparray
            stage2 = compiled
            if index[1] == 1:
                compiled = np.stack([stage2, stage1])
            else:
                compiled = np.vstack([stage2, stage1[None, :]])
        if index[1] == 2605:
            masterarray[index[0], :] = compiled
TornadoEric
  • 399
  • 3
  • 16

1 Answers1

1

Having multiple threads modify an array is usually a bad idea. It's better to just have the tasks calculate the values, but let the main thread actually create the array.

def initialize_arrays(a, b):
    global arrayone, arraytwo
    arrayone = a
    arraytwo = b

def get_masterarray_row(index):
    contents = .... calculate the contents of masterarray[index] ...
    return index, contents

def main():
    masterarray = np.empty([1228,2606,208])
    with mp.Pool(initializer=initialize_arrays, initargs=(arrayone, arraytwo)) as pool:
        for index, contents in pool.imap_unordered(get_masterarray_row, range(1228)):
            masterarray[index, :] = contents

Frank Yellin
  • 9,127
  • 1
  • 12
  • 22
  • This definitely decreases the runtime notably, I've clocked a 50% reduction thus far. A follow-up question to your solution: how does mp.Pool determine how many processes to run at once, and can this number be increased/decreased at the mp.Pool with statement? – TornadoEric Apr 18 '23 at 21:50
  • 1
    You're best off reading the documentation to find all the possible options. By default, the number of processes is the number of cpus, but it can be changed with `Pool(processes=10)` or just `Pool(10)`. https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool – Frank Yellin Apr 18 '23 at 22:34