1

I have measured the performance of a parallel read_pickle() execution on a Linux machine with 12 cores and Python 3.6 interpreter (code launched in JupyterLab). I simply open many pickled dataframes:

import pandas as pd

def my_read(filename):    
    df = pd.read_pickle(path + filename)
    print(filename, df.shape)
    return df.iloc[:1, :]

files = ... # array of file names of about 130 pickled 1 000 000 x 43 dataframes

Since this is an IO-bound operation rather than a CPU-bound one, I would expect the threaded solution to win over the process-based one.

However, this cell:

%%time
from multiprocessing import Pool
with Pool(10) as pool:
    pool.map(my_read, files) 

gave

CPU times: user 416 ms, sys: 267 ms, total: 683 ms
Wall time: 3min 37s

while this one:

from multiprocessing.pool import ThreadPool
with ThreadPool(10) as tpool:
    tpool.map(my_read, files)

run in

user 7min 28s, sys: 1min 58s, total: 9min 27s
 Wall time: 10min 25s

Why?

kdr
  • 93
  • 8
  • I find this intriguing, given advice I have read such as https://stackoverflow.com/questions/51828790/what-is-the-difference-between-processpoolexecutor-and-threadpoolexecutor which, as @kdr points out, suggests the opposite should be expected. – JonSG Oct 18 '21 at 13:37

0 Answers0