3

I am trying to use multiprocessing on a 12-core machine to read an Excel file – a 60MB file with 15 sheets and 10,000 rows each. Importing all the sheets with pandas.read_csv and no parallelisation still takes about 33 seconds.

If I use pool.map() it works, but it takes longer than the non-parallel version: 150 seconds vs 33!

If I use pool.map_async() it takes 36 seconds, but I can`t seem to access (and cannot therefore check) the output!

My questions are:

  • what am I doing wrong? both pool.map and pool.map_async take roughly the same time even if I set nrows=10 in the read_single_sheet function; same time whether it reads 10 rows or 10,000 – how is that possible?
  • How do I get the results of pool.map_async()? I have tried output = [p.get() for p in dataframes] but it doesn`t work:

MapResult object is not iterable

  • Is this more of a IO-bound than a CPU-bound problem? Still, why does pool.map take so long?

Reading the same data from CSV (each Excel sheet saved to a separate CSV) takes 2 seconds on my machine. However, CSV is not really a good option for what I need to do. I often have 10 to 20 mid-sized tabs; converting them manually may often take longer than waiting for pandas to read them, plus if I receive updated versions I have to do the manual conversion again.

I know I could use a VBA script in Excel to automatically save each sheet to CSV, but data types are most often inferred correctly when reading from Excel – not so with CSV, especially for dates (my dates are never ISO yyyy-mm-dd): I’d have to identify the date fields, specify the format, etc – just reading from Excel would often be faster. Especially because these tasks tend to be one-offs: I import the data once, maybe twice or 3 times if I receive an update, store it in SQL and then all my Python scripts read from SQL.

The code I am using to read the file is:

import numpy as np
import pandas as pd
import time
import multiprocessing
from multiprocessing import Pool
def parallel_read():
    pool = Pool(num_cores)
    # reads 1 row only, to retrieve column names and sheet names
    mydic = pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None)
    sheets =[]
    for d in mydic:
        sheets.extend([d])
    dataframes  = pool.map( read_single_sheet , sheets  )
    return dataframes

def parallel_read_async():
    pool = Pool(num_cores)
    # reads 1 row only, to retrieve column names and sheet names
    mydic = pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None)
    sheets =[]
    for d in mydic:
        sheets.extend([d])
    dataframes  = pool.map_async( read_single_sheet , sheets  ) 
    output = None
    # this below doesn`t work - can`t understand why
    output = [p.get() for p in dataframes]
    return output

def read_single_sheet(sheet):
    out = pd.read_excel('excel_write_example.xlsx', sheet_name=sheet )
    return out

num_cores = multiprocessing.cpu_count() 

if __name__=='__main__':
    start=time.time()
    out_p = parallel_read()
    time_par = time.time() -start

    out_as = parallel_read_async()
    time_as = time.time() - start - time_par

The code I used to create the Excel is:

import numpy as np
import pandas as pd

sheets = 15
rows= int(10e3)

writer = pd.ExcelWriter('excel_write_example.xlsx')

def create_data(sheets, rows):
    df = {} # dictionary of dataframes
    for i in range(sheets):
        df[i] = pd.DataFrame(data= np.random.rand(rows,30) )
        df[i]['a'] = 'some long random text'
        df[i]['b'] = 'some more random text'
        df[i]['c'] = 'yet more text'
    return df

def data_to_excel(df, writer):
    for d in df:
        df[d].to_excel(writer, sheet_name = str(d), index=False)
    writer.close()

df = create_data(sheets, rows)
data_to_excel(df, writer)
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
  • Just wondering, did you have a look at dask? – Erfan Apr 03 '19 at 22:34
  • I forgot to mention I had looked at dask: it has a read_csv module, but not a read_excel. And, like I said, converting to CSV is rarely a better solution for me – Pythonista anonymous Apr 03 '19 at 22:36
  • 2
    I don't think your approach will work. While one process is trying to read from the excel file, won't the other processes have to wait until the synchronization lock is closed? That is why it is slower. You should think about your problem a little differently. Stream the records from your excel in chunks to the process pool from your parent process instead of having the pool workers all try to access the file simultaneously. – skullgoblet1089 Apr 03 '19 at 22:37
  • I am not sure I am following: it's slower because, every time a process reads from the file, it locks it? So two processes cannot read from the same file at the same time? Are you suggesting different processes should access the same sheet? Why would that work and not be subject to the lock? – Pythonista anonymous Apr 03 '19 at 22:40
  • What if I made copies of the file? Could process 1 access sheet 1 from file 1, while process 2 reads sheet 2 from file 2, etc? Would that work, or would it still be subject to the synchronization lock? – Pythonista anonymous Apr 03 '19 at 22:41
  • 1
    Yeah, you are starting each process in parallel at the same time; your empirical results suggest to me that only one can access at a time via pd.read_excel method. If you logged with the process id in your target function you'll probably see that the processes are all taking turns to read data from excel and calculate their response. To fix this, i would read all of the relevant data in the main process in memory, then submit that data to the process pool as an argument for the workers to consume. – skullgoblet1089 Apr 03 '19 at 22:44
  • 1
    For the record, IO has almost nothing to do with this issue, and locking has nothing at all to do with it: the problem is that `xlrd` processes every row of every sheet every time you call `read_excel`. – Nathan Vērzemnieks Apr 04 '19 at 21:37

3 Answers3

2

I am posting this as an answer because, while it doesn’t answer the question of how to do it in Python, it still provides a feasible alternative to speed up the reading time materially, so it can be of interest to any Python user; additionally, it relies only on open-source software, and requires the user to learn only a couple of commands in R.

My solution is: do it in R!

I posted about it here , which also shows my (very minimal) code ; basically, on the same file, R’s readxl took 5.6 seconds. To recap:

  • Python from xlsx: 33 seconds
  • Python from CSV: ca. 2 seconds
  • R from xlsx: 5.6 seconds

The link also has an answer which shows parallelising can speed up the process even more.

I believe the key difference is that the pandas.read_csv relies on C code, while pandas.read_excel relies on more Python code. R's readxl is probably based on C. It might be possible to use a C parser to import xlsx files into Python, but AFAIK no such parser is available as of now.

It is a feasible solution because, after importing into R, you can easily export to a format which retains all the information on data types, and which Python can read from (SQL, parquet, etc). Not everyone will have a SQL server available, but formats like parquet or sqlite don't require any additional software.

So the changes to the workflow are minimal: the initial data loading, which, at least in my case, tends to be a one-off, is in R, and everything else continues to be in Python.

I also noticed that exporting the same sheets to SQL is much faster with R and DBI::dbWriteTable than with pandas (4.25 sec vs 18.4 sec).

halfer
  • 19,824
  • 17
  • 99
  • 186
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
1

A couple of things are going on here:

  • The 36 seconds that parallel_read_async seems to be taking is in fact entirely taken up by the call to pd.read_excel('excel_write_example.xlsx', nrows=1, sheet_name=None). map_async returns immediately, giving you the MapResult object, and you're immediately causing an exception by trying to iterate over it, so in this version essentially no work is being done by the read_single_sheet function.
  • Furthermore, pd.read_excel with sheet_name=None is taking exactly as long as with sheet_name='1' etc. - so in your parallel_read function, each process is doing the work of parsing every row of every sheet. This is why it takes so much longer.

And now that I've written out, I remember that my company ran into this same problem, and we ended up implementing our own xlsx parser because of it. There's simply no way with xlrd - which pandas uses - to open an xlsx file without parsing it completely.

If you have the option to produce (or receive?) xls files instead, those should be much quicker to work with. Besides that, the export-to-csv option may be your best bet, if the speed of the non-parallel processing is unacceptable.

Nathan Vērzemnieks
  • 5,495
  • 1
  • 11
  • 23
  • How do I get the results of the async function into a list? I am asking because I want to understand how the function works , even though it is not very well-suited for this specific task – Pythonista anonymous Apr 04 '19 at 11:19
  • The docs for the object `map_async` returns are here: https://docs.python.org/3.6/library/multiprocessing.html#multiprocessing.pool.AsyncResult. in short, call `get` with an optional timeout to get a list of results, or give `map_async` a callback argument that will be called with each result as it becomes available. – Nathan Vērzemnieks Apr 04 '19 at 21:35
  • Thank you, but I am afraid I still do not fully understand. Could you please show me how I should modify the code so as to store the output of map_async? – Pythonista anonymous Apr 04 '19 at 21:58
-1

Here is an outline of how you can bypass file lock and achieve concurrency with little changes to your code:

import io
import xlrd
from functools import partial

def read_sheet(buff, sheetname):
    # reads 1 row only, to retrieve column names and sheet names
    df = pd.read_excel(buff, sheetname=sheetname)
    return df

if __name__=='__main__':
    start=time.time()
    time_par = time.time() -start
    xl = xlrd.open_workbook("myfile.xls")  # you fill in this
    sheets = xl.book.sheet_names()
    buff = io.BytesIO()
    xl.dump(buff)
    buff.seek(0)
    target = partial(read_sheet, buff)
    with Pool(num_processes) as p:
        dfs = p.map(target, sheetnames)
    time_as = time.time() - start - time_par

skullgoblet1089
  • 554
  • 4
  • 12
  • This code has numerous errors - I corrected at least five of them trying to get it to run. It still fails, because `xl.dump` doesn't do what you think it does - it dumps debug data, none of the contents of the file. Even if this did somehow make the file available to the subprocesses, it wouldn't help - it would still be no faster than the original question's code, because of the problem with xlrd I described in my answer. Please make sure you do at least minimal testing of your solutions before posting them. – Nathan Vērzemnieks Apr 04 '19 at 03:47
  • Thanks Nathan, it was an outline meant to illustrate an idea. I did not say it was a working solution. – skullgoblet1089 Apr 04 '19 at 10:57