concatenate all sheets in the excel file, some of which have a different skiprows criteria

Question

I have an Excel workbook with 8 sheets in it. They all follow the same column header structure. The only difference is, the first sheet starts at row 1, but the rest of the sheets start at row 4.

I am trying to run a command like this, but this is giving me the wrong data - and I recognize that because I wrote sheet_name=None this will give me issues as the sheets start at different rows:

df = pd.concat(pd.read_excel(xlsfile, sheet_name=None, skiprows=4), sort=True)

My next attempt was to:

frames = []
df = pd.read_excel(xlsfile, sheet_name='Questionnaire')
for sheet in TREND_SHEETS:
    tmp = pd.read_excel(xlsfile, sheet_name=sheet, skiprows=4)
    # append tmp dynamically to frames, then use concat frames at the end.. ugly
    df.append(tmp, sort=False)

return df

Note, Questionnaire is the first sheet in the Excel workbook. I know the logic here is off, and I do not want to create dynamic variables holding the 'tmp', appending it to a list, and then concatenating the frames.

How can I go about solving this, so that I achieve a dataframe which incorporates all the sheet data?

Is the problem that you need the skiprows to be different for different sheets? — foobarbaz, Feb 09 '19 at 18:08
@cdc200 I need skiprows to be 4 for all but 1 sheet.. That one sheet does not need the skiprows argument — sgerbhctim, Feb 09 '19 at 18:08
that is your question though, correct? Or am I misunderstanding the problem? — foobarbaz, Feb 09 '19 at 18:15
The problem is, I want to concatenate all sheets in the excel file, some of which have a different skiprows criteria than others — sgerbhctim, Feb 09 '19 at 18:17

score 3 · Answer 1 · edited Jun 15 '22 at 04:17

3

What I would do is have a config file, like a Python dictionary with the sheetnames as keys, and the values can be the number_of_rows to skip.

Thanks to @parfait for proving a better solution, it is best to concatenate outside of the for loop as its more memory efficient. What you can do it append the dfs to a list within the for loop, then concatenate outside.

import pandas as pd
sheets = {
    'Sheet1': 1,
    'Sheet2': 4,
    'Sheet3': 4,
    'Sheet4': 4
}

list_df = list()
for k, v in sheets.items():
    tmp = pd.read_excel(xlsfile, sheetname=k, skiprows=v)
    list_df.append(tmp)


final_df = pd.concat(list_df, ignore_index=True)

edited Jun 15 '22 at 04:17

Oleg Valter is with Ukraine

9,619
8
37
59

answered Feb 09 '19 at 18:15

Jenobi

368
4
12

This helps a bunch. Let me apply it to my current situation and I will get back to you. Thank you! – sgerbhctim Feb 09 '19 at 18:17
Did not produce the full data on this.. the issue is, you are redefining `tmp` each time, and `df_final` is produced within that for loop, so `df_final` does not hold all that data properly.. – sgerbhctim Feb 09 '19 at 18:27
1

[Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Feb 09 '19 at 19:42

score 3 · Accepted Answer · edited Jun 15 '22 at 04:14

3

Consider a list comprehension to build a list of data frames for concatenating once outside the loop. To borrow @Jenobi's dictionary approach:

sheets = {'sheet1': 1, 'sheet2': 4, 'sheet3': 4, 'sheet4': 4}

df_list = [pd.read_excel(xlsfile, sheetname=k, skiprows=v) \
              for k,v in sheets.items()]

final_df = pd.concat(df_list, ignore_index=True)

edited Jun 15 '22 at 04:14

Oleg Valter is with Ukraine

9,619
8
37
59

answered Feb 09 '19 at 19:52

Parfait

104,375
17
94
125

Beautiful. Thank you. – sgerbhctim Feb 09 '19 at 20:25

concatenate all sheets in the excel file, some of which have a different skiprows criteria

2 Answers2