I am trying to use multiprocessing on a 12-core machine to read an Excel file – a 60MB file with 15 sheets and 10,000 rows each. Importing all the sheets with pandas.read_csv and no parallelisation still takes about 33 seconds.
If I use pool.map() it works, but it takes longer than the non-parallel version: 150 seconds vs 33!
If I use pool.map_async() it takes 36 seconds, but I can`t seem to access (and cannot therefore check) the output!
My questions are:
- what am I doing wrong? both pool.map and pool.map_async take roughly the same time even if I set nrows=10 in the read_single_sheet function; same time whether it reads 10 rows or 10,000 – how is that possible?
- How do I get the results of pool.map_async()? I have tried
output = [p.get() for p in dataframes]
but it doesn`t work:
MapResult object is not iterable
- Is this more of a IO-bound than a CPU-bound problem? Still, why does pool.map take so long?
Reading the same data from CSV (each Excel sheet saved to a separate CSV) takes 2 seconds on my machine. However, CSV is not really a good option for what I need to do. I often have 10 to 20 mid-sized tabs; converting them manually may often take longer than waiting for pandas to read them, plus if I receive updated versions I have to do the manual conversion again.
I know I could use a VBA script in Excel to automatically save each sheet to CSV, but data types are most often inferred correctly when reading from Excel – not so with CSV, especially for dates (my dates are never ISO yyyy-mm-dd): I’d have to identify the date fields, specify the format, etc – just reading from Excel would often be faster. Especially because these tasks tend to be one-offs: I import the data once, maybe twice or 3 times if I receive an update, store it in SQL and then all my Python scripts read from SQL.
The code I am using to read the file is:
import numpy as np
import pandas as pd
import time
import multiprocessing
from multiprocessing import Pool
def parallel_read():
pool = Pool(num_cores)
# reads 1 row only, to retrieve column names and sheet names
mydic = pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None)
sheets =[]
for d in mydic:
sheets.extend([d])
dataframes = pool.map( read_single_sheet , sheets )
return dataframes
def parallel_read_async():
pool = Pool(num_cores)
# reads 1 row only, to retrieve column names and sheet names
mydic = pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None)
sheets =[]
for d in mydic:
sheets.extend([d])
dataframes = pool.map_async( read_single_sheet , sheets )
output = None
# this below doesn`t work - can`t understand why
output = [p.get() for p in dataframes]
return output
def read_single_sheet(sheet):
out = pd.read_excel('excel_write_example.xlsx', sheet_name=sheet )
return out
num_cores = multiprocessing.cpu_count()
if __name__=='__main__':
start=time.time()
out_p = parallel_read()
time_par = time.time() -start
out_as = parallel_read_async()
time_as = time.time() - start - time_par
The code I used to create the Excel is:
import numpy as np
import pandas as pd
sheets = 15
rows= int(10e3)
writer = pd.ExcelWriter('excel_write_example.xlsx')
def create_data(sheets, rows):
df = {} # dictionary of dataframes
for i in range(sheets):
df[i] = pd.DataFrame(data= np.random.rand(rows,30) )
df[i]['a'] = 'some long random text'
df[i]['b'] = 'some more random text'
df[i]['c'] = 'yet more text'
return df
def data_to_excel(df, writer):
for d in df:
df[d].to_excel(writer, sheet_name = str(d), index=False)
writer.close()
df = create_data(sheets, rows)
data_to_excel(df, writer)