0

I have an Excel workbook with 8 sheets in it. They all follow the same column header structure. The only difference is, the first sheet starts at row 1, but the rest of the sheets start at row 4.

I am trying to run a command like this, but this is giving me the wrong data - and I recognize that because I wrote sheet_name=None this will give me issues as the sheets start at different rows:

df = pd.concat(pd.read_excel(xlsfile, sheet_name=None, skiprows=4), sort=True)

My next attempt was to:

frames = []
df = pd.read_excel(xlsfile, sheet_name='Questionnaire')
for sheet in TREND_SHEETS:
    tmp = pd.read_excel(xlsfile, sheet_name=sheet, skiprows=4)
    # append tmp dynamically to frames, then use concat frames at the end.. ugly
    df.append(tmp, sort=False)

return df

Note, Questionnaire is the first sheet in the Excel workbook. I know the logic here is off, and I do not want to create dynamic variables holding the 'tmp', appending it to a list, and then concatenating the frames.

How can I go about solving this, so that I achieve a dataframe which incorporates all the sheet data?

sgerbhctim
  • 3,420
  • 7
  • 38
  • 60

2 Answers2

3

What I would do is have a config file, like a Python dictionary with the sheetnames as keys, and the values can be the number_of_rows to skip.

Thanks to @parfait for proving a better solution, it is best to concatenate outside of the for loop as its more memory efficient. What you can do it append the dfs to a list within the for loop, then concatenate outside.

import pandas as pd
sheets = {
    'Sheet1': 1,
    'Sheet2': 4,
    'Sheet3': 4,
    'Sheet4': 4
}

list_df = list()
for k, v in sheets.items():
    tmp = pd.read_excel(xlsfile, sheetname=k, skiprows=v)
    list_df.append(tmp)


final_df = pd.concat(list_df, ignore_index=True)
Jenobi
  • 368
  • 4
  • 12
  • This helps a bunch. Let me apply it to my current situation and I will get back to you. Thank you! – sgerbhctim Feb 09 '19 at 18:17
  • Did not produce the full data on this.. the issue is, you are redefining `tmp` each time, and `df_final` is produced within that for loop, so `df_final` does not hold all that data properly.. – sgerbhctim Feb 09 '19 at 18:27
  • 1
    [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Feb 09 '19 at 19:42
3

Consider a list comprehension to build a list of data frames for concatenating once outside the loop. To borrow @Jenobi's dictionary approach:

sheets = {'sheet1': 1, 'sheet2': 4, 'sheet3': 4, 'sheet4': 4}

df_list = [pd.read_excel(xlsfile, sheetname=k, skiprows=v) \
              for k,v in sheets.items()]

final_df = pd.concat(df_list, ignore_index=True)
Parfait
  • 104,375
  • 17
  • 94
  • 125