7

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.

I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.

When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?

Amine
  • 145
  • 1
  • 1
  • 5
  • 2
    This is covered by the openpyxl documentation. – Charlie Clark Jul 05 '16 at 17:05
  • http://stackoverflow.com/questions/38208389/how-to-stream-in-and-manipulate-a-large-data-file-in-python – dot.Py Jul 05 '16 at 17:15
  • Possible duplicate of [Is openpyxl the fastest package to use to read big xlsx files?](http://stackoverflow.com/questions/23796587/is-openpyxl-the-fastest-package-to-use-to-read-big-xlsx-files) – Charlie Clark Jul 06 '16 at 06:30

3 Answers3

6

Use openpyxl's read-only mode to do this.

You'll be able to work with the relevant worksheet instantly.

Tyler Dane
  • 951
  • 1
  • 14
  • 25
Charlie Clark
  • 18,477
  • 4
  • 49
  • 55
4

Here is it, i found a solution. The fastest way to read an xlsx sheet.

56mb file with over 500k rows and 4 sheets took 6s to proceed.

import zipfile
from bs4 import BeautifulSoup

paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")

for name in file.namelist():
    if name == 'xl/workbook.xml':
        data = BeautifulSoup(file.read(name), 'html.parser')
        sheets = data.find_all('sheet')
        for sheet in sheets:
            paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])

for path in paths:
    if path[0] == mySheet:
        with file.open(path[1]) as reader:
            for row in reader:
                print(row)  ## do what ever you want with your data
        reader.close()

Enjoy and happy coding.

Amine
  • 145
  • 1
  • 1
  • 5
  • 3
    This does not "read" the file at all just iterates over the XML. When reading Excel files most time is spent converting from the XML to the relevant Python structures. – Charlie Clark Jul 06 '16 at 06:27
-3

The load time you're experiencing is directly related to the io speed of your memory chip.

When pandas loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding).

In terms of a solution: I'd suggest, as a workaround:

  • load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
  • save your file to a csv format for local use.
  • For even better performance, use an optimized data structure such as parquet

For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python

Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69