1

I am working with very big Excel files, which take a long time to be loaded with Pandas in Python. Before processing the data, the user has to select quite a few options related to the data, which only require the names of the each column in each dataset. It is very inconvenient for the user to have to wait sometimes minutes until the data is loaded to be able to select the necessary options and then let the program do the actual processing for another few minutes.

So, my question is: is there a way to load only the data header from an Excel file with Python? In a way I think of it as an alternate version to the "skiprows" parameter in the read_excel Pandas function, where instead of skipping rows in the beginning of the data, I would like to skip rows at the end of the data. I want to emphasize that my goal is to reduce the time Python takes to load the files. I also know there are ways to do this with csv files, but unfortunately it didn't help me.

Thank you for the help!

Victor
  • 33
  • 5
  • try `nrows` to read in the first 1-2 rows for the header. The data shouldn't take very long to load though, are the excel workbooks large? – Umar.H Jun 17 '20 at 09:14
  • @Datanovice check the my response to the answer below. And the workbooks I mentioned are really large, mostly 100MB+ – Victor Jun 17 '20 at 12:45

3 Answers3

4

You can try to use the sxl module (https://pypi.org/project/sxl/). Here is the code I tried for a large excel file (around 75,000 rows) and the timing results:

from datetime import datetime
startTime = datetime.now()
import pandas as pd
import sxl


startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx')
print("Time taken to load whole data with pandas read excel is {}".format((datetime.now() - startTime)))


startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx', nrows = 5)
print("Time taken with top 5 rows with pandas read excel is {}".format((datetime.now() - startTime)))


startTime = datetime.now()
wb = sxl.Workbook('\\Big_Excel.xlsx')
ws = wb.sheets[1]
data = ws.head(5)
print("Time taken to load top 5 rows using sxl is {}".format((datetime.now() - startTime)))

Pandas read excel loads the whole data in memory, so there is not much of a difference difference in timing. Here are the outputs from the above:

  • Time taken to load whole data with pandas read excel is 0:00:49.174538
  • Time taken with top 5 rows with pandas read excel is 0:00:44.478523
  • Time taken to load top 5 rows using sxl is 0:00:00.671717

I hope this helps!!

K_Raikar
  • 126
  • 5
  • Thanks for the answer, K_Raikar! That is exactly what I needed, the project description on the PyPI page is perfect. Thank you so much! – Victor Jun 17 '20 at 16:01
  • @Victor can we use above code to read csv file size around 100gb. i want to get column header only – kakaji Feb 19 '22 at 16:05
0

You can use 'skipfooter' parameter or 'nrows' parameter in both .xlsx & .csv. However, both cannot be used together.

path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, skipfooter = 99999)

which means, 99999 rows will be skipped from footer to top & remaining records from header will load.

path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, nrows= 5)

which means, first 5 rows will be shown with header.

Also refer this Stack over flow Question.

Viknesh S K
  • 101
  • 3
  • 9
  • Thanks for the answer, Viknesh! I Unfortunately this does not cut the time pandas takes to open the file, it just loads the whole data and returns the parsed dataframe (the run time of adding one of these parameters is just slightly higher than loading the file without any additional parameters). – Victor Jun 17 '20 at 12:42
0
from dask import dataframe as dd

df= dd.read_csv(“filename”)

Trust me its fast I am reading 800 mb of file

jezza_99
  • 1,074
  • 1
  • 5
  • 17
kakaji
  • 161
  • 9
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 20 '22 at 02:33