4

What are the Python3 options to efficiently (performance and memory) extract sheet names and for a given sheet, and also column names from a very large .xlsx file?

I've tried using pandas:

For sheet names using pd.ExcelFile:

    xl = pd.ExcelFile(filename)
    return xl.sheet_names

For column names using pd.ExcelFile:

    xl = pd.ExcelFile(filename)
    df = xl.parse(sheetname, nrows=2, **kwargs)
    df.columns

For column names using pd.read_excel with and without nrows (>v23):

    df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
    df.columns

However, both pd.ExcelFile and and pd.read_excel seem to read the entire .xlsx in memory and are therefore slow.

Thanks a lot!

elke
  • 1,220
  • 2
  • 12
  • 24
  • 2
    Don't have anything handy to test, but how does `dfs = pd.read_excel(filename, sheet_name=None, nrows=0)` perform? You should get a dictionary with sheet names as keys and an empty DataFrame as its values... – Jon Clements Aug 10 '18 at 09:36

4 Answers4

4

Here is the easiest way I can share with you:

# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names
Jade Cacho
  • 691
  • 6
  • 9
1

According to this SO question, reading excel files in chunks is not supported (see this issue on github), and using nrows will always read all the file into memory first.

Possible solutions:

  • Convert the sheet to csv, and read that in chunks.
  • Use something other than pandas. See this page for a list of alternative libraries.
Qusai Alothman
  • 1,982
  • 9
  • 23
1

I think this would help the need

from openpyxl import load_workbook

workbook = load_workbook(filename, read_only=True)

data = {}   #for storing the value of sheet with their respective columns

for sheet in worksheets:
    for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
        data[sheet.title] = value #value would be a tuple with headings of each column
Jis Mathew
  • 175
  • 1
  • 7
0

This program lists all the sheets in the excel. Pandas is used here.

import pandas as pd
with pd.ExcelFile('yourfile.xlsx') as xlsx :
    sh=xlsx.sheet_names
print("This workbook has the following sheets : ",sh)