1

I have a very large matrix (over 100k by 100K) with a calculation logic whereby each row can be calculated distinct from other rows

I want to use multiprocessing to optimize compute time (with the matrix split into 3 slices of 1/3 rows each). However it seems like multiprocessing takes longer than a single call to calculate all rows. I am changing different parts of the matrix in each process- is that the issue?

import multiprocessing, os
import time, pandas as pd, numpy as np

def mat_proc(df):
    print("ID of process running worker1: {}".format(os.getpid()))
    return(df+3)  # simplified version of process  
    print('done processing')
          
count=5000

df = pd.DataFrame(np.random.randint(0,10,size=(3*count,3*count)),dtype='int8')
slice1=df.iloc[0:count,]
slice2=df.iloc[count:2*count,]
slice3=df.iloc[2*count:3*count,]

p1=multiprocessing.Process(target=mat_proc,args=(slice1,))
p2=multiprocessing.Process(target=mat_proc,args=(slice2,))
p3=multiprocessing.Process(target=mat_proc,args=(slice3,))

start=time.time()
print('started now')
# this is to compare the multiprocess with a single call to full matrix
#mat_proc(df)

if __name__ == '__main__':   
    p1.start()
    p2.start()
    p3.start()
    p1.join()
    p2.join()
    p3.join()
    
finish=time.time()
print(f'total time taken {round(finish-start,2)}')
martineau
  • 119,623
  • 25
  • 170
  • 301
raghu
  • 339
  • 2
  • 12

2 Answers2

3

When using multiprocessing move all script parts to if __name__ == '__main__' part. Because when each process spawns it runs your main script. So each process had to recreate dataframe, slicing, etc.

import multiprocessing, os
import time, pandas as pd, numpy as np


def mat_proc(df):
    print("ID of process running worker1: {}".format(os.getpid()))
    return (df + 3)  # simplified version of process
    print('done processing')


if __name__ == '__main__':
    count = 5000

    df = pd.DataFrame(np.random.randint(0, 10, size=(3 * count, 3 * count)), dtype='int8')
    slice1 = df.iloc[0:count, ]
    slice2 = df.iloc[count:2 * count, ]
    slice3 = df.iloc[2 * count:3 * count, ]

    p1 = multiprocessing.Process(target=mat_proc, args=(slice1,))
    p2 = multiprocessing.Process(target=mat_proc, args=(slice2,))
    p3 = multiprocessing.Process(target=mat_proc, args=(slice3,))

    start = time.time()
    print('started now')
    # this is to compare the multiprocess with a single call to full matrix
    # mat_proc(df)

    p1.start()
    p2.start()
    p3.start()
    p1.join()
    p2.join()
    p3.join()

    finish = time.time()
    print(f'total time taken {round(finish - start, 2)}')

And consider using multiprocessing.Pool, it can be handy to be able to choose how many processes you want to spawn by changing single number.

Second thing, if computations are easy (as in the simplified version of process you provieded) spawning processes, sending data to it (pickling and unpickling dataframe) will take longer than those computations and multiprocessing will be slower.

dankal444
  • 3,172
  • 1
  • 23
  • 35
  • That's not true. The child process execution begins at the target that you pass to the Process constructor. In this case, its the mat_proc function. Honestly, this makes no difference. – Crash0v3rrid3 Nov 06 '21 at 17:03
  • @Crash0v3rrid3 ok, found out it is Windows-only thing, and [on Windows it makes a difference](https://stackoverflow.com/a/57559085/4601890). Second part of the answer is good for both Windows and Unix - if OP has just fast operations to perform, multiprocessing will have bottleneck on pickling input and output dataframe. – dankal444 Nov 06 '21 at 17:14
  • Only managed datastructures use pickling for data exchange. In unix-based machines, when a new process is spawned using fork, the OS makes sure to duplicate the memory space (it uses copy on write to improve performance). So, this isn't much of a bottleneck cuz he's not performing writes. – Crash0v3rrid3 Nov 06 '21 at 17:18
  • Also, my bad on the windows part. – Crash0v3rrid3 Nov 06 '21 at 17:19
  • 1
    @Crash0v3rrid3 Thank you for those comments. What do you mean by `managed datastructures`? I am not sure if OP is not performing writes ("I am changing different parts of the matrix in each process") – dankal444 Nov 06 '21 at 17:32
  • 1
    IPC Queues(https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue), for example. – Crash0v3rrid3 Nov 06 '21 at 17:36
  • @Crash0v3rrid3 thank you, I will share what I found too, [this pickling of arguments is also platform-dependent](https://stackoverflow.com/a/26026069/4601890) – dankal444 Nov 06 '21 at 17:52
  • Yeah, looks like it. I'm not a big windows fan. Have been using linux for almost everything, hence the bias . Thanks for the info though! – Crash0v3rrid3 Nov 06 '21 at 18:00
1

Spawning processes is a costly operation. If you're not performing tasks in the new processes which make the process spawn time seem neglible you would be better off sticking to one process.

Another option could be to use multithreading, which costs less than multiprocessing. You must decide which one to use based on the scale of your data & total processing time.

This article explains the differences and costs well. Check it out!

Also, using multiprocessing.pool.Pool & multiprocessing.pool.ThreadPool would be cleaner. Check the example below & the official doc to understand their usages.

from multithreading.pool import Pool, ThreadPool


def run_parallel(kls):
    with kls() as pool:
        return pool.map(mat_proc, [df.iloc[0:count,], df.iloc[count: 2 * count, ], df.iloc[2 * count: 3 * count, ]])


run_parallel(Pool)        # Run with multiprocessing
run_parallel(ThreadPool)  # Run with multithreading
Crash0v3rrid3
  • 518
  • 2
  • 6
  • 1
    By using multithreading he won't see any performance gain (in this case), I think the choice is multiprocessing or single process (with maybe some `numba` to do speed the things up and use parallelism) – dankal444 Nov 06 '21 at 17:42
  • Why not? Are you referring to the GIL? – Crash0v3rrid3 Nov 06 '21 at 18:01
  • Yes, threads speed things up when doing some I/O bound tasks which I think is not the case here. – dankal444 Nov 06 '21 at 18:29