pandas OrderedDict strip space read_excel

Question

I have an excel file with over 100 sheets. I need to import into pandas and create a dataframe. The problem is some columns headings contain spaces so I get a dataframe with some duplicate columns.

Is it possible to strip a space in columns in OrderedDict? I know I can use str.strip() in the dataframe but can't find anything for OrderedDict.

I have the following structure in excel:

sheet TEST:
   'ID1' 'ID2'  'ID3'  'ID4'
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

sheet TEST2:
   'ID1' 'ID2'  'ID3 ' 'ID4'
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

sheet TEST3:
   'ID1' 'ID2'  'ID3' 'ID4 '
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

At the moment I do the following:

df = pd.read_excel (File location, sheet_name = ['TEST', 'TEST2', 'TEST3'])

df = pd.concat(df, axis=0, sort=False)
df = df.reset_index()

This is what I get:

   'ID1'  'ID2'  'ID3'  'ID4' 'ID3 ' 'ID4 '
1   A     2016    val    val   NaN    NaN
2   B     2017    val    val   NaN    NaN
3   C     2018    val    val   NaN    NaN
4   A     2016    NaN    val   val    NaN
5   B     2017    NaN    val   val    NaN
6   C     2018    NaN    val   val    NaN
7   A     2016    val    NaN   NaN    val
8   B     2017    val    NaN   NaN    val
9   C     2018    val    NaN   NaN    val

This is what I need:

   'ID1'  'ID2'  'ID3'  'ID4' 
1   A     2016    val    val   
2   B     2017    val    val   
3   C     2018    val    val   
4   A     2016    val    val   
5   B     2017    val    val   
6   C     2018    val    val   
7   A     2016    val    val   
8   B     2017    val    val   
9   C     2018    val    val

Many thanks

it is easier if you separate the sheet into different variable, like df1 = sheet1, df2 = sheet2, df3 = sheet3 — Jovan, Sep 19 '19 at 12:11

DanTan · Accepted Answer · 2019-09-19T12:40:22.297

3

If you are reading in multiple sheets, and your workbook is large, it's more efficient to use pd.ExcelFile to create an ExcelFile object, then access the sheets individually. It also makes the column name formatting very straight-forward:

import pandas as pd

xls = pd.ExcelFile(filepath)

dfs = []
for sheet in xls.sheet_names:
    df = pd.read_excel(xls, sheet)
    df.columns = df.columns.str.strip()
    dfs.append(df)

df = pd.concat(dfs)

edited Sep 19 '19 at 12:40

answered Sep 19 '19 at 12:36

DanTan

660
7
17

great answer, I routinely do the same for very large workbooks with n number of sheets. Didn't know about the performance difference, have you tested this or is there anything you can point me towards for my education? – Umar.H Sep 19 '19 at 12:38
1

Yep, just added a link to a question discussing this issue: https://stackoverflow.com/q/26521266/1186342 – DanTan Sep 19 '19 at 12:41

Jovan · Answer 2 · 2019-09-19T12:43:31.750

Simply define separate, to make it easier:

df1 = pd.read_excel (File location TEST, sheet_name = 'TEST')
df2 = pd.read_excel (File location TEST2, sheet_name = 'TEST2')
df3 = pd.read_excel (File location TEST3, sheet_name = 'TEST3')

and then just concat it like this:

df = pd.concat([df1,df2,df3], axis=0)
df = df.reset_index()

Hope it helps :)

EDIT: If you wanted to apply it with loop, just do the following: assuming that you have 150 sheets, and the name of your sheet is iterable like "TEST" "TEST2" "TEST3".."TEST150"

listdf = []
name=""
for x in range(150):
    if x==0: name = "TEST"
    elif x>0: name= "TEST" + str(x+1)
    df = pd.read_excel (File location TEST, sheet_name = name)
    listdf.append(df)

newbigdf = pd.concat(listdf, axis=0)

pandas OrderedDict strip space read_excel

2 Answers2