Pandas Read Excel, or faster method

Question

I didn't find any related thread.

I have a not so big Excel file (100 MB), but quite large (ie 930 columns for 35k rows), and that's my problem. Excel read this file in a second, but pandas took at least 10-20 minutes on my computer. I tried the following:

not infering type, by giving dtype parameter.
limit columns, by using usecols
iterate over rows, by using nrows and skiprows in a loop

I can not convert this excel into a csv.

This is my code so far:

df = pd.read_excel("rei_2018/REI_2018.xlsx", engine = "openpyxl", dtype = str, usecols=['H11'], nrows=200)

edit 1:

data : https://www.impots.gouv.fr/portail/www2/fichiers/statistiques/base_de_donnees/rei/rei_2018.zip
I tried running the following command on the above data (thus limiting to 200 row), it hooks exactly 760seconds.

df = pd.read_excel("rei_2018/REI_2018.xlsx", engine = "openpyxl", dtype = str, usecols=['H11'], nrows=200)

pandas versions (Windows 64bits with Anaconda, on 8Gb Intel i5 8265U):

pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : b5958ee1999e9aead1938c0bba2b674378807b3d
python           : 3.7.6.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.18362
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 1.1.5
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.3.5
hypothesis       : 5.5.4
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.3.5
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.2
fsspec           : 0.6.2
fastparquet      : None
gcsfs            : None
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : 0.13.0
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.48.0

Side note: "not so big Excel file (100 MB)" - that's a *massive* Excel file (and a red flag in and of itself). — BigBen, Jun 22 '21 at 16:19
Indeed, remember [Excel is not a database](https://www.google.com/search?q=excel+is+not+a+database). — Parfait, Jun 22 '21 at 17:11
Of course, I am just a user of this file, which data are OpenSourced. Thus, I can choose the file format extract of their database. My issue is not with the size, I have the same issue with 1KB Excel, with a lot of columns. — Pdeuxa, Jun 22 '21 at 17:16
This performance is interesting. Please describe your environment (OS, virtualenv, RAM, core speed, 32-bit/64-bit, etc.). In fact, edit post with output of `pd.show_versions`. Show some sample data (or generate it for reproducibility). If many columns is an issue, this could be a ticket for [pandas team](https://github.com/pandas-dev/pandas/issues). — Parfait, Jun 22 '21 at 17:23
I'm not sure you meant 1KB Excel... maybe 1MB. Either way... sounds like Excel is being (ab)used as a database. — BigBen, Jun 22 '21 at 17:45
I 100% agree with that, and I will prefer dealing with a much larger csv file. I don't understand why nowadays, this kind of data are exposed as an Excel Format. Concerning the performance, it's kind of strange, as is a specific column is selected (use_cols), it seems that all row are still parsed, and the position in the header of the column is not use. I will try to use openpyxl only to see if I have a better performance. — Pdeuxa, Jun 22 '21 at 17:59

score 3 · Answer 1 · answered Jun 22 '21 at 18:38

Consider an ODBC connection to Excel workbook to query worksheet like a database table with pandas.read_sql. Below ODBC driver installs with most Windows MS Office installations. SQL query retrieves one column, H11, and first 200 rows of worksheet named mySheet.

import pyodbc
import pandas as pd

strfile = "/path/to/workbook.xlsx"

conn = pyodbc.connect(
    r'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'
    'DBQ={};'.format(strfile), autocommit=True
)

df = pd.read_sql("SELECT TOP 200 [H11] FROM [mySheet$]", conn)

# ALTERNATIVELY, ASSUMING H11 COLUMN IS "H" COLUMN IN SPREADSHEET
df = pd.read_sql("SELECT [H11] FROM [mySheet$H1:H200]", conn)

conn.close()

score 0 · Answer 2 · answered Jun 22 '21 at 16:24

Without converting your excel file to a csv, I think you're somewhat beat given what is said in this post.

That being said, it may be worth looking into Excel VBA (Visual Basic for Applications) macros. It's pretty quick to pick up and reads similarly to a python script, I had used it a few years back when I had a similar problem to you (needed to read from many excel files and couldn't use Python).

Pandas Read Excel, or faster method

2 Answers2