0

I want to read all sheets in a a excel and make it as one data frame but one catch here is that only in the first sheet the headers start from 4th line and other sheets i want to eliminate only the header since we have it from first sheet.

I tried using

df = pd.concat(pd.read_excel('abc.xlsx', sheet_name=None,header = 4), ignore_index=True),

I get some random columns as output.

Ideally i want to

sheets_dict = pd.read_excel('abc.xlsx', sheet_name=None)

full_table = pd.DataFrame()
for name, sheet in sheets_dict.items():
if name = "Sheet1":
    skip first four lines
else 
    skip only 1 line.

Can someone help me how to do this in python for pandas.

Ashwin
  • 17
  • 5
  • see https://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook – Anurag Dabas Aug 02 '21 at 11:25
  • Does this answer your question? [Using Pandas to pd.read\_excel() for multiple worksheets of the same workbook](https://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook) – Mohammad Aug 02 '21 at 11:28

2 Answers2

3

Example/Given: enter image description here

Try:

import pandas as pd

ex_file = pd.ExcelFile('multiSheet.xlsx')

dfs = []

print(ex_file.sheet_names) # ['Sheet1', 'Sheet2', 'Sheet3']

for sheet in ex_file.sheet_names:
    if sheet == 'Sheet1':
        df = ex_file.parse(sheet, skiprows=4, parse_dates=True)
    else:
        df = ex_file.parse(sheet, parse_dates=True)
    dfs.append(df)

df = pd.concat(dfs)

df = df.reset_index(drop=True)

print(df)

Outputs:

    ID  Name    Date
0   1   John    2019-05-01
1   2   Jane    2019-05-02
2   3   Joe     2019-05-03
3   4   Sarah   2019-05-04
4   5   Kenny   2019-05-05
5   6   Jenny   2019-05-06
6   7   Kate    2019-05-07
7   8   Sue     2019-05-08
8   9   Frank   2019-05-09
9   10  Bob     2019-05-10
10  11  Beth    2019-05-11
11  12  Maggie  2019-05-12
12  13  Gerry   2019-05-13
13  14  Jerry   2019-05-14
14  15  Fiona   2019-05-15
MDR
  • 2,610
  • 1
  • 8
  • 18
  • To avoid hard-coding "Sheet1" in your script, you could declare `df = None` before the loop and then check `if df is None: ...` instead of `if sheet == "Sheet1": ...`. – hlzl Aug 02 '21 at 12:29
  • True. 'Sheet1' in this case happens to be the first sheet. If the nth sheet needs the first few rows skipping the above can be changed to match the correct sheet name whatever it is and would still work (hopefully! :o). – MDR Aug 02 '21 at 12:41
  • Thanks a Lot, i think this should solve my issue – Ashwin Aug 04 '21 at 10:31
1

I'm normally the one working with duplicate headers already in the file. In that case you could do something similar whereby consolidating everything into the data frame and dealing with the headers after-the-fact. The solution below would be suitable for headers that are the same. I've nominated one field name to indicate to me that it is a header - in this case labelled 'the_repeated_header_field_name' ; alternatively you may choose to evaluate more than one field if required.

#remove repeated heading columns arising from data 'appends' 
df = df[~df['the_repeated_header_field_name'].str.contains('the_repeated_header_field_name', na=False, regex=False)]
Will_d0g
  • 44
  • 4