2

I have an excel file, I guess it is quite huge for an excel file (200Mb), it has around 20 sheets full of information.

My question is that if it is normal that the following simple action takes almost 5 mins to execute. I am wondering if I am doing it in the correct way.

import xlrd

def processExcel(excelFile):
    excelData = xlrd.open_workbook(excelFile)
    sheets = excelData.sheet_names()
    print sheets

As you can see, on the first step I am just trying to get the sheet names and only that simple thing takes 5 mins...is that possible?

codeKiller
  • 5,493
  • 17
  • 60
  • 115

1 Answers1

9

Yes, it's absolutely possible. That is indeed a lot of data to be in an Excel file. By default, xlrd loads the entire workbook into memory. If your workbook is a .xls file, you can use the on_demand parameter to only open worksheets as they are needed:

import xlrd

def processExcel(excelFile):
    excelData = xlrd.open_workbook(excelFile, on_demand=True)
    sheets = excelData.sheet_names()
    print sheets

If you are trying to open a .xlsx file, the on_demand parameter has no effect.

Update

If you are using Python 3 and reading a .xlsx file, you can try sxl. This is a library which only reads things into memory as needed. So just opening the workbook to retrieve the worksheet names is very quick. Also, if you just need the first few rows of a worksheet, it can get those rather quickly as well.

If you need to read all the data with sxl, you have to iterate over all the rows, which could be even slower than xlrd, but at least will only use up as much memory as you need. For example, the following code will only keep one row in memory at any given time:

from sxl import Workbook

wb = Workbook('MyBigFile.xlsx')
ws = wb.sheets[1]
for row in ws.rows:
    print(row)

However, if you need random access to all the rows to do your processing, you'll have to keep them all in memory:

from sxl import Workbook

wb = Workbook('MyBigFile.xlsx')
ws = wb.sheets[1]
all_rows = list(ws.rows)

In this case, all_rows keeps the entire sheet in memory. If your workbook has multiple sheets, this may still be more efficient than xlrd. But if you need your whole workbook in memory, then you might as well stick to xlrd.

John Y
  • 14,123
  • 2
  • 48
  • 72
  • 2
    I don't have much experience using [OpenPyXL](https://pypi.python.org/pypi/openpyxl) but its [read-only mode](http://openpyxl.readthedocs.org/en/latest/optimized.html) is supposed to help with very large files. It's worth a try. – John Y Jul 03 '15 at 12:26
  • Thanks! Pls add `excelData.release_resources()` otherwise file remains open. See https://stackoverflow.com/questions/33241837/python-xlrd-book-how-to-close-the-files – citynorman Nov 22 '17 at 16:24
  • Can you please provide an example for retrieving sheet names? – MehmedB Jul 16 '19 at 09:21