0

I have a question about the performance of openpyxl when reading files.

I am trying to read the same xlsx file using ProcessPoolExecutor, single file Maybe 500,000 to 800,000 rows.

In read-only mode calling sheet.iter_rows(), when not using ProcessPoolExecutor, reading the entire worksheet, it takes about 1s to process 10,000 rows of data. But when I set the max_row and min_row parameters with ProcessPoolExecutor, it is different.

totalRows: 200,000
1 ~ 10000 take 1.03s
10001 ~ 20000 take 1.73s
20001 ~ 30000 take 2.41s
30001 ~ 40000 take 3.27s
40001 ~ 50000 take 4.06s
50001 ~ 60000 take 4.85s
60001 ~ 70000 take 5.93s
70001 ~ 80000 take 6.64s
80001 ~ 90000 take 7.72s
90001 ~ 100000 take 8.18s
100001 ~ 110000 take 9.42s
110001 ~ 120000 take 10.04s
120001 ~ 130000 take 10.61s
130001 ~ 140000 take 11.17s
140001 ~ 150000 take 11.52s
150001 ~ 160000 take 12.48s
160001 ~ 170000 take 12.52s
170001 ~ 180000 take 13.01s
180001 ~ 190000 take 13.25s
190001 ~ 200000 take 13.46s
total: take 33.54s

Obviously, just looking at the results of each process, the time consumed is indeed less. But the overall time consumption has increased. And the further back the scope, the more time each process consumes. Read 200,000 rows with a single process only takes about 20s.

I'm not very clear with iterators and haven't looked closely at the source code of openpyxl. From the time consumption, even if the range is set, the iterator still needs to start processing from row 1, I don't know if this is the case.

I'm not a professional programmer, if you happen to have relevant experience, please try to be as simple as possible

codes here!!!

import openpyxl
from time import perf_counter
from concurrent.futures import ProcessPoolExecutor

def read(file, minRow, maxRow):
    start = perf_counter()
    book = openpyxl.load_workbook(filename=file, read_only=True, keep_vba=False, data_only=True, keep_links=False)
    sheet = book.worksheets[0]
    val = [[cell.value for cell in row] for row in sheet.iter_rows(min_row=minRow, max_row=maxRow)]
    book.close()
    end = perf_counter()
    print(f'{minRow} ~ {maxRow}', 'take {0:.2f}s'.format(end-start))
    return val


def parallel(file: str, rowRanges: list[tuple]):
    futures = []
    with ProcessPoolExecutor(max_workers=6) as pool:
        for minRow, maxRow in rowRanges:
            futures.append(pool.submit(read, file, minRow, maxRow))
    return futures

if __name__ == '__main__':
    file = '200000.xlsx'
    start = perf_counter()
    tasks = getRowRanges(file)
    parallel(file, tasks)
    end = perf_counter()
    print('total: take {0:.2f}s'.format(end-start))
  • You'll only get better performance through parallelisation if you're trying to read multiple sheets at once. This is covered in the openpyxl documentation: https://openpyxl.readthedocs.io/en/stable/performance.html. Just trying stuff out and expecting magic is rarely a good idea. – Charlie Clark Mar 18 '22 at 09:53

2 Answers2

1

Q :
"... a question about the performance ..."
... please try to be as simple as possible ...

A :
Having 6 Ferrari sport racing cars ( ~ max_workers = 6 )
does not provide a warranty to move 6 drivers ( ~ The Workload )
from start to the end
in 1 / 6 of the time.

That does not work,
even if we have a 6-lane wide racing track ( which we have not ), as you have already reported, there is a bottleneck ( a 1-lane wide only bridge, on the way from the start to the end of the race ).

Actually,
there are more performance-devastating bottlenecks ( The Bridge as the main performance blocker and a few smaller, less blocking, nevertheless performance further degrading bridges ), some avoidable, some not :

  • the file-I/O has been no faster than ~ 10k [rows/s] in a pure solo serial run
    so never expect the same speed to appear "across" the same (single, single lane) bridge ( the shared file-I/O hardware interface ) for any next, concurrently running Ferrari, competing for using the same resource, already used for the first process to read from file ( real-hardware latencies matter, a lot ... the Devil is in details )

  • another, avoidable, degradation comes with expensive add-on costs, paid for each and every list.append(). Here, try to choose a different object, avoiding a list-based storage at all and pre-allocate a block-storage ( one time paid RAM-allocation costs ) having an advantage of a know size of the result-storage, and keep storing data on-the-fly, best in cache-line respectful blocks than incrementally ( might be too technical, yet if performance is to get maxed-up, these details matter )

  • dual-iterator SLOC is nice for a workbook example, yet if performance is or focus, try to find another way, perhaps using even a simpler XLS-reader ( without as many machinery under the hood, as VBA interpreter et al ), which can export the row-wise consumed cells into a plain-text, that can get collected way way faster, than the as-is code did in a triplet-of-nested-iterators "syntax-sugared" SLOC
    [ [ ... for cell in row ] for row in sheet.iterator(...) ]

  • last comes also the process instantiation costs, that enter the revised Amdahl's Law, reformulated so that it takes into account also the overheads and atomicity of (blocks of) work. For ( technically independent ) details may see this and these - where interactive speedup-simulator calculators are often linked to test the principal ceiling any such parallelisation efforts will never be able to overcome.

  • Last, but by no means the least - The MEMORY: take your .xlsx file size and multiply it by ~ 50x and next by 6 workers ~ that amount of physical memory is expected to be used ( see doc: "Memory use is fairly high in comparison with other libraries and applications and is approximately 50 times the original file size, e.g. 2.5 GB for a 50 MB Excel file" credit to @Charlie Clark ) If your system does not have that much physical-RAM, the O/S starts to suffocate as truing to allocate that and goes into a RAM-swap-"thrashing" mode ( moving blocks-of-RAM to disk-swap area and back and there and back, as interleaving the 6 workers going forwards in Virtual-Memory-managed address space simulated inside a small physical-RAM at awfully high (more than 5(!) orders of magnitude longer) disk-I/O latencies, trying to cross the already blocking performance bottleneck, yeah - The Bridge ... where traffic-jam is already at max, as 6 workers try to do the very same - move some more data across the even more blocked bottleneck ) all that at awfully great latency skyrocketing jump on doing so (see URL on latencies above). A hint may, yet need not save us, plus this and this may reduce, better straight prevent further inefficiencies

user3666197
  • 1
  • 6
  • 50
  • 92
0

I believe to have the same problem as OP.

The puzzling part is that once min_row and max_row is set on sheet.iter_rows(), concurrent execution does not apply anymore, as if there was some sort of global lock in effect.

The following code is trying to dump data from one single large sheet from an Excel file. The idea is to take advantage of the min_row and max_row on sheet.iter_rows to lock down a reading window and ThreadPoolExecutor for concurrent execution.

# artificially create a set of row index ranges,
# 10,000 rows per set till 1,000,000th row
# something like [(1, 10_000), (10_001, 20_000), .....]
def _ranges():
    _i = 1
    _n = 10_000
    while _i <= 1_000_000:
        yield _i, _i + _n - 1
        _i += _n


def write_to_file(file, mn, mx):
    print(f'write to file {mn}-{mx}')
    wb = load_workbook(file, read_only=True
                       , data_only=True, keep_links=False, keep_vba=False)
    sheet = wb[wb.sheetnames[0]]

    out_file = _dst / f"{mn}-{mx}.txt"
    row_count = 1
    with out_file.open('w', encoding='utf8') as f:

        rows = sheet.iter_rows(values_only=True, min_row=mn, max_row=mx)

        for row in rows:
            print(f'section {mn}-{mx} write {row_count}')
            f.write(' '.join([str(c).replace('\n', ' ') for c in row]) + '\n')
            row_count += 1


def main():
    fut = []
    with futures.ThreadPoolExecutor() as ex:
        for mn, mx in _ranges():
            fut.append(ex.submit(write_to_file, _file, mn, mx))

    futures.wait(fut)

All write_to_file() do kick off all at once. enter image description here

Iteration over rows, however, seems to behave in strict sequential fashion. enter image description here

With a little change:

def write_to_file(file, mn, mx):
    print(f'write to file {mn}-{mx}')
    wb = load_workbook(file, read_only=True
                       , data_only=True, keep_links=False, keep_vba=False)
    sheet = wb[wb.sheetnames[0]]

    out_file = _dst / f"{mn}-{mx}.txt"
    row_count = 1
    with out_file.open('w', encoding='utf8') as f:

        rows = sheet.iter_rows(values_only=True)
                              # ^^^^^^^^^^^^^^^^^___min_row/max_row not set

        for row in rows:
            print(f'section {mn}-{mx} write {row_count}')
            f.write(' '.join([str(c).replace('\n', ' ') for c in row]) + '\n')
            row_count += 1

Section 20001-30000 writes first! enter image description here

The chaotic effect of concurrent execution takes place. enter image description here

But, without min_row and max_row, there is no point to have concurrent execution at all.

Bing-hsu Gao
  • 343
  • 3
  • 14