1

I wondered if anyone could help. I am combining multiple excel files that have between 1-3 sheets. I want to combine these sheets into 3 dataframes and have done so using the below code:

all_workbook1 = pd.DataFrame()
all_workbook2 = pd.DataFrame()
all_workbook3 = pd.DataFrame()

for f in glob.glob("*.xlsx"):
    dfworkbook1 = pd.read_excel(f, sheet_name="sheet1", usecols="B:AO")
    dfworkbook1["Filename"] = "[" + os.path.basename(f) + "]"
    all_workbook1 = all_workbook1.append(dfworkbook1,ignore_index=True)

    dfworkbook2 = pd.read_excel(f, sheet_name="sheet2", usecols="B:AO")
    dfworkbook2["Filename"] = "[" + os.path.basename(f) + "]"
    all_workbook2 = all_workbook2.append(dfworkbook2,ignore_index=True)

    dfworkbook3 = pd.read_excel(f, sheet_name="sheet3", usecols="B:AO")
    dfworkbook3["Filename"] = "[" + os.path.basename(f) + "]"
    all_workbook3 = all_workbook3.append(dfworkbook3,ignore_index=True)

When running this I can get the below error:

xlrd.biffh.XLRDError: No sheet named <'sheet3'>

I believe this is due to the fact that not all of my files have 'sheet3'. What is the best process to avoid this? I have tried to add code to the start that runs over the files and adds the missing sheets as a blank sheet but have been struggling with this.

Any help would be great. Thanks, Dan

Dan
  • 13
  • 2
  • [Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) Instead use a list of data frames and bind together **once** with `pd.concat`. – Parfait Nov 26 '20 at 21:06

1 Answers1

0

Consider using a defined method that runs try/except to account for potentially missing sheets. Then call method within several list comprehensions for corresponding sheet data frame lists that are ultimately concatenated together:

def read_xl_data(file, sh):
   try:
      df = (pd.read_excel(file, sheet_name=sh, usecols="B:AO") 
              .assign(Filename = f"[{os.path.basename(file)}]"))
   except:
      df = pd.DataFrame()

   return df


# LIST COMPREHENSIONS TO RETRIEVE SPECIFIC SHEETS
sheet1_dfs = [read_xl_data(f, "sheet1") for f in glob.glob("*.xlsx")]
sheet2_dfs = [read_xl_data(f, "sheet2") for f in glob.glob("*.xlsx")]
sheet3_dfs = [read_xl_data(f, "sheet3") for f in glob.glob("*.xlsx")]

# CONCAT CORRESPONDING SHEET DFS TOGETHER
all_workbook1 = pd.concat(sheet_1_dfs)
all_workbook2 = pd.concat(sheet_2_dfs) 
all_workbook3 = pd.concat(sheet_3_dfs)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Amazing. Thank you for the help with this! The only part I can't get to work is creating the column "Filename". If I use the code as it is, it appears the dataframes are empty. I then tried to do it in the same method as before by doing: all_workbook1["Filename"] = = "[" + os.path.basename(f) + "]", this then advises that 'f' is not found. Any help would be great. Thanks again – Dan Nov 27 '20 at 13:27
  • Whoops! Please replace `f` with `file` in that F-string (which assumes you use Python 3.6+). – Parfait Nov 27 '20 at 17:19
  • Amazing! I really appreciate the assistance with this! – Dan Nov 30 '20 at 10:02