Processing large XLSX file in python

Question

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.

I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.

When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?

http://stackoverflow.com/questions/38208389/how-to-stream-in-and-manipulate-a-large-data-file-in-python — dot.Py, Jul 05 '16 at 17:15
Possible duplicate of [Is openpyxl the fastest package to use to read big xlsx files?](http://stackoverflow.com/questions/23796587/is-openpyxl-the-fastest-package-to-use-to-read-big-xlsx-files) — Charlie Clark, Jul 06 '16 at 06:30

score 6 · Answer 1 · edited Aug 22 '18 at 16:58

6

Use openpyxl's read-only mode to do this.

You'll be able to work with the relevant worksheet instantly.

edited Aug 22 '18 at 16:58

Tyler Dane

951
1
14
25

answered Jul 05 '16 at 17:03

Charlie Clark

18,477
4
49
55

i already did that, but it loads the whole file in memory before starting reading row by row. – Amine Jul 05 '16 at 17:50
1

No, it does not read the whole file into memory. Presumably you weren't using it correctly. – Charlie Clark Jul 06 '16 at 06:28

score 4 · Answer 2 · answered Jul 05 '16 at 18:24

Here is it, i found a solution. The fastest way to read an xlsx sheet.

56mb file with over 500k rows and 4 sheets took 6s to proceed.

import zipfile
from bs4 import BeautifulSoup

paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")

for name in file.namelist():
    if name == 'xl/workbook.xml':
        data = BeautifulSoup(file.read(name), 'html.parser')
        sheets = data.find_all('sheet')
        for sheet in sheets:
            paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])

for path in paths:
    if path[0] == mySheet:
        with file.open(path[1]) as reader:
            for row in reader:
                print(row)  ## do what ever you want with your data
        reader.close()

Enjoy and happy coding.

This does not "read" the file at all just iterates over the XML. When reading Excel files most time is spent converting from the XML to the relevant Python structures. — Charlie Clark, Jul 06 '16 at 06:27

Yaakov Bressler · Answer 3 · 2022-04-29T02:04:16.707

The load time you're experiencing is directly related to the io speed of your memory chip.

When pandas loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding).

In terms of a solution: I'd suggest, as a workaround:

load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
save your file to a csv format for local use.
For even better performance, use an optimized data structure such as parquet

For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python

Processing large XLSX file in python

3 Answers3

Linked