4

I am trying to process an excel file with ~600k rows and ~10 columns.

I want to execute some program line by line (or row by row) as if it is an iterator (like txt/csv files). However, if I use xlrd or pandas to read the excel file, it takes ~2-3min for opening the file.

I wonder if it is possible to read the excel file line by line efficiently so that, for example, I can verify my program with the first 10 rows without waiting for a long time for every trial.

Edit: Thank you for suggesting to convert the excel file to a csv before processing. However, I have to create an one-click program to the user. So, I still hope to find out a way to directly read the excel file efficiently, if possible.

Mayan
  • 492
  • 4
  • 11
  • 1
    i am not familiar with xlrd, - [openpyxl](https://openpyxl.readthedocs.io/en/stable/optimized.html) has a read only mode that could help with ur usecase. have a look at it – sammywemmy Apr 01 '20 at 08:32
  • you can declare your dtypes before hand and the format of any datetime columns, this would help. you could also use `nrows` to read in 10k rows to test validity – Umar.H Apr 01 '20 at 08:41
  • Does this answer your question? [Faster way to read Excel files to pandas dataframe](https://stackoverflow.com/questions/28766133/faster-way-to-read-excel-files-to-pandas-dataframe) – Umar.H Apr 01 '20 at 08:41
  • @Datanovice Thank you for your suggestion. This is also my suggested solution but I have to help create some one-click program for the user. It is not preferable for me to include a vbs script or an additional csv file generated after the execution of the program – Mayan Apr 01 '20 at 08:48
  • @sammywemmy, The speed of openpyxl is impressive. Let me try to apply to the target excel file later. – Mayan Apr 01 '20 at 09:20
  • 1
    @sammywemmy, I tried and your solution meets my need exactly. Thank you – Mayan Apr 02 '20 at 01:15

2 Answers2

1

The pandas read_excel has a nrows parameter that controls the number of lines that will be read. pd.read_excel(your_xslx_file.xlsx, nrows=10)

You can read the number of lines you want, and if you need it to start reading further down the file, it can be controlled with the parameter skiprows.

  • Thanks for the suggestion. Let me have a try later! – Mayan Apr 01 '20 at 08:50
  • Seems that there is no such method(`.read_xlsx`) in pandas, and I tried `pd.read_excel(file, nrows = 10`, the time spent is the same – Mayan Apr 01 '20 at 09:07
  • I've corrected it to read_excel, thank you. Then I suggest that the file be converted to a .csv file and then use the `datatable ` module: `datatable.fread(converted_csv)`. – Ignacio Valenzuela Apr 02 '20 at 10:33
0

Excel files are zip folder with xml file for each worksheet, maybe it's more efficient to open this file has a zip and read your line with a xml library ?