6

Reading data (just 20000 numbers) from a xlsx file takes forever:

import pandas as pd
xlsxfile = pd.ExcelFile("myfile.xlsx")
data = xlsxfile.parse('Sheet1', index_col = None, header = None)

takes about 9 seconds.

If I save the same file in csv format it takes ~25ms:

import pandas as pd
csvfile = "myfile.csv"
data = pd.read_csv(csvfile, index_col = None, header = None)

Is this an issue of openpyxl or am I missing something? Are there any alternatives?

sashkello
  • 17,306
  • 24
  • 81
  • 109
  • 1
    There's lots more overhead in XLSX. For one, it must be uncompressed before parsing. Another reason is that it is XML, which must be parsed. 9 seconds does seem pretty high for this, but there are good reasons why it is many times slower. – bbayles Apr 24 '13 at 03:31
  • Yes, I understand, but 9 seconds seem to be somewhat too much... – sashkello Apr 24 '13 at 03:43
  • 5
    read_csv is a highly optimized piece of c code; excel is read via a pure python library – Jeff Apr 24 '13 at 03:44

1 Answers1

3

xlrd has support for .xlsx files, and this answer suggests that at least the beta version of xlrd with .xlsx support was quicker than openpyxl.

The current stable version of Pandas (11.0) uses openpyxl for .xlsx files, but this has been changed for the next release. If you want to give it a go, you can download the dev version from GitHub

Community
  • 1
  • 1
Matti John
  • 19,329
  • 7
  • 41
  • 39