Read Excel file line by line efficiently

Question

I am trying to process an excel file with ~600k rows and ~10 columns.

I want to execute some program line by line (or row by row) as if it is an iterator (like txt/csv files). However, if I use xlrd or pandas to read the excel file, it takes ~2-3min for opening the file.

I wonder if it is possible to read the excel file line by line efficiently so that, for example, I can verify my program with the first 10 rows without waiting for a long time for every trial.

Edit: Thank you for suggesting to convert the excel file to a csv before processing. However, I have to create an one-click program to the user. So, I still hope to find out a way to directly read the excel file efficiently, if possible.

i am not familiar with xlrd, - [openpyxl](https://openpyxl.readthedocs.io/en/stable/optimized.html) has a read only mode that could help with ur usecase. have a look at it — sammywemmy, Apr 01 '20 at 08:32
you can declare your dtypes before hand and the format of any datetime columns, this would help. you could also use `nrows` to read in 10k rows to test validity — Umar.H, Apr 01 '20 at 08:41
Does this answer your question? [Faster way to read Excel files to pandas dataframe](https://stackoverflow.com/questions/28766133/faster-way-to-read-excel-files-to-pandas-dataframe) — Umar.H, Apr 01 '20 at 08:41
@Datanovice Thank you for your suggestion. This is also my suggested solution but I have to help create some one-click program for the user. It is not preferable for me to include a vbs script or an additional csv file generated after the execution of the program — Mayan, Apr 01 '20 at 08:48
@sammywemmy, The speed of openpyxl is impressive. Let me try to apply to the target excel file later. — Mayan, Apr 01 '20 at 09:20
@sammywemmy, I tried and your solution meets my need exactly. Thank you — Mayan, Apr 02 '20 at 01:15

Ignacio Valenzuela · Answer 1 · 2020-04-02T10:29:20.443

1

The pandas read_excel has a nrows parameter that controls the number of lines that will be read. pd.read_excel(your_xslx_file.xlsx, nrows=10)

You can read the number of lines you want, and if you need it to start reading further down the file, it can be controlled with the parameter skiprows.

edited Apr 02 '20 at 10:29

answered Apr 01 '20 at 08:43

Ignacio Valenzuela

137
8

Thanks for the suggestion. Let me have a try later! – Mayan Apr 01 '20 at 08:50
Seems that there is no such method(`.read_xlsx`) in pandas, and I tried `pd.read_excel(file, nrows = 10`, the time spent is the same – Mayan Apr 01 '20 at 09:07
I've corrected it to read_excel, thank you. Then I suggest that the file be converted to a .csv file and then use the `datatable ` module: `datatable.fread(converted_csv)`. – Ignacio Valenzuela Apr 02 '20 at 10:33

score 0 · Answer 2 · answered Apr 01 '20 at 08:33

0

Excel files are zip folder with xml file for each worksheet, maybe it's more efficient to open this file has a zip and read your line with a xml library ?

answered Apr 01 '20 at 08:33

Jacques Herrmann

21
2

Read Excel file line by line efficiently

2 Answers2