1

I have a series of large (and poorly formatted) excel spreadsheets that I am trying to process with pandas. Each excel file contains 50-60 sheets, and I am only interested in a subset of the sheets, within each file.

I have tried to read the entire spreadsheet as an pd.ExcelFile object, so I can use the sheet_names attribute to parse particular sheets (and I don't know the names of each sheet ahead of time). This works - but seems exceptionally slow (close to a minute for each ~30mb excel file).

I can only assume this is because each sheet is being parsed as the pd.ExcelFile object is being initialised (...could be wrong?). If so, is there a way to prevent this behaviour? - I really only want to get the sheet names, and then parse the specific sheets from there.

Thanks in advance!

djmac
  • 827
  • 5
  • 11
  • 27
  • 1
    Maybe this topic could help you: https://stackoverflow.com/questions/12250024/how-to-obtain-sheet-names-from-xls-files-without-loading-the-whole-file – Aleksandr Iurkin Feb 20 '20 at 05:32

2 Answers2

1

to best of my knowledge pandas uses xlrd or similar engine to open and parse the excel file. xlrd is the default engine. When you open an excel file using xlrd it defaults to loading all sheets. Thus pandas presumably does as well. You might have better luck opening the excel files using xlrd, setting on_demand kwarg to True, and then defining the df after pulling in data using xlrd.

born_naked
  • 748
  • 9
  • 19
  • Thanks - and as per @aleksandr-iurkin comment above, see related question here: https://stackoverflow.com/questions/12250024/how-to-obtain-sheet-names-from-xls-files-without-loading-the-whole-file – djmac Feb 20 '20 at 10:10
0

Excel worksheets typically have lots and lots of formatting, all of which needs to be consumed and interpreted when an Excel file is opened. Can you parse out the specific sheets that you need? Do you know those in advance? If so, you could split multiple Excel files, each with multiple sheets, into separate files, and focus ONLY on those objects. Try the code below and see if it helps you get to where you need to be.

import os
import xlrd
from xlutils.copy import copy
import xlwt

path = 'C:\\path_to_Multiple_Excel_Files\\'
targetdir = ('C:\\path_to_out_files\\') #where you want your new files

if not os.path.exists(targetdir): #makes your new directory
    os.makedirs(targetdir)

for root,dir,files in os.walk(path, topdown=False): #all the files you want to split
    xlsfiles=[f for f in files] #can add selection condition here

for f in xlsfiles:
    wb = xlrd.open_workbook(os.path.join(root, f), on_demand=True)
    for sheet in wb.sheets(): #cycles through each sheet in each workbook
        newwb = copy(wb) #makes a temp copy of that book
        newwb._Workbook__worksheets = [ worksheet for worksheet in newwb._Workbook__worksheets if worksheet.name == sheet.name ]
        #brute force, but strips away all other sheets apart from the sheet being looked at
        newwb.save(targetdir + f.strip(".xls") + sheet.name + ".xls") 
        #saves each sheet as the original file name plus the sheet name
ASH
  • 20,759
  • 19
  • 87
  • 200