csv & xlsx files import to pandas data frame: speed issue

Question

Reading data (just 20000 numbers) from a xlsx file takes forever:

import pandas as pd
xlsxfile = pd.ExcelFile("myfile.xlsx")
data = xlsxfile.parse('Sheet1', index_col = None, header = None)

takes about 9 seconds.

If I save the same file in csv format it takes ~25ms:

import pandas as pd
csvfile = "myfile.csv"
data = pd.read_csv(csvfile, index_col = None, header = None)

Is this an issue of openpyxl or am I missing something? Are there any alternatives?

There's lots more overhead in XLSX. For one, it must be uncompressed before parsing. Another reason is that it is XML, which must be parsed. 9 seconds does seem pretty high for this, but there are good reasons why it is many times slower. — bbayles, Apr 24 '13 at 03:31
Yes, I understand, but 9 seconds seem to be somewhat too much... — sashkello, Apr 24 '13 at 03:43
read_csv is a highly optimized piece of c code; excel is read via a pure python library — Jeff, Apr 24 '13 at 03:44

score 3 · Answer 1 · edited May 23 '17 at 11:51

3

xlrd has support for .xlsx files, and this answer suggests that at least the beta version of xlrd with .xlsx support was quicker than openpyxl.

The current stable version of Pandas (11.0) uses openpyxl for .xlsx files, but this has been changed for the next release. If you want to give it a go, you can download the dev version from GitHub

edited May 23 '17 at 11:51

Community

1
1

answered Apr 25 '13 at 09:00

Matti John

19,329
7
41
39

csv & xlsx files import to pandas data frame: speed issue

1 Answers1

Linked