0

I have an excel file with over 100 sheets. I need to import into pandas and create a dataframe. The problem is some columns headings contain spaces so I get a dataframe with some duplicate columns.

Is it possible to strip a space in columns in OrderedDict? I know I can use str.strip() in the dataframe but can't find anything for OrderedDict.

I have the following structure in excel:

sheet TEST:
   'ID1' 'ID2'  'ID3'  'ID4'
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

sheet TEST2:
   'ID1' 'ID2'  'ID3 ' 'ID4'
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

sheet TEST3:
   'ID1' 'ID2'  'ID3' 'ID4 '
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

At the moment I do the following:

df = pd.read_excel (File location, sheet_name = ['TEST', 'TEST2', 'TEST3'])

df = pd.concat(df, axis=0, sort=False)
df = df.reset_index()

This is what I get:

   'ID1'  'ID2'  'ID3'  'ID4' 'ID3 ' 'ID4 '
1   A     2016    val    val   NaN    NaN
2   B     2017    val    val   NaN    NaN
3   C     2018    val    val   NaN    NaN
4   A     2016    NaN    val   val    NaN
5   B     2017    NaN    val   val    NaN
6   C     2018    NaN    val   val    NaN
7   A     2016    val    NaN   NaN    val
8   B     2017    val    NaN   NaN    val
9   C     2018    val    NaN   NaN    val

This is what I need:

   'ID1'  'ID2'  'ID3'  'ID4' 
1   A     2016    val    val   
2   B     2017    val    val   
3   C     2018    val    val   
4   A     2016    val    val   
5   B     2017    val    val   
6   C     2018    val    val   
7   A     2016    val    val   
8   B     2017    val    val   
9   C     2018    val    val   

Many thanks

Matteo
  • 165
  • 1
  • 11
  • it is easier if you separate the sheet into different variable, like df1 = sheet1, df2 = sheet2, df3 = sheet3 – Jovan Sep 19 '19 at 12:11

2 Answers2

3

If you are reading in multiple sheets, and your workbook is large, it's more efficient to use pd.ExcelFile to create an ExcelFile object, then access the sheets individually. It also makes the column name formatting very straight-forward:

import pandas as pd

xls = pd.ExcelFile(filepath)

dfs = []
for sheet in xls.sheet_names:
    df = pd.read_excel(xls, sheet)
    df.columns = df.columns.str.strip()
    dfs.append(df)

df = pd.concat(dfs)
DanTan
  • 660
  • 7
  • 17
  • great answer, I routinely do the same for very large workbooks with n number of sheets. Didn't know about the performance difference, have you tested this or is there anything you can point me towards for my education? – Umar.H Sep 19 '19 at 12:38
  • 1
    Yep, just added a link to a question discussing this issue: https://stackoverflow.com/q/26521266/1186342 – DanTan Sep 19 '19 at 12:41
0

Simply define separate, to make it easier:

df1 = pd.read_excel (File location TEST, sheet_name = 'TEST')
df2 = pd.read_excel (File location TEST2, sheet_name = 'TEST2')
df3 = pd.read_excel (File location TEST3, sheet_name = 'TEST3')

and then just concat it like this:

df = pd.concat([df1,df2,df3], axis=0)
df = df.reset_index()

Hope it helps :)

EDIT: If you wanted to apply it with loop, just do the following: assuming that you have 150 sheets, and the name of your sheet is iterable like "TEST" "TEST2" "TEST3".."TEST150"

listdf = []
name=""
for x in range(150):
    if x==0: name = "TEST"
    elif x>0: name= "TEST" + str(x+1)
    df = pd.read_excel (File location TEST, sheet_name = name)
    listdf.append(df)

newbigdf = pd.concat(listdf, axis=0)
Jovan
  • 763
  • 7
  • 26