1

I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).

Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.

I have tried below code

# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys

# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")

for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
    sys.stdout.write(str(file))
    sys.stdout.flush()
    xls = pd.ExcelFile(file)
    out_df = pd.DataFrame() ## create empty output dataframe

    for sheet in xls.sheet_names:
        sys.stdout.write(str(sheet))
        sys.stdout.flush() ## # View the excel files sheet names
        #df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
        df = pd.read_excel(file, sheetname=sheet)
        out_df = out_df.append(df)  ## This will append rows of one dataframe to another(just like your expected output)

Question:

My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.

Is there any other efficient way to import and append all sheets from multiple excel files?

Community
  • 1
  • 1
yc.koong
  • 175
  • 2
  • 10
  • Hi @run-out thanks for comment, i have checked the link, and i have read that before, it only append up to file level but not worksheet level. – yc.koong Apr 16 '19 at 04:37

3 Answers3

4

Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:

out_df = pd.DataFrame()
for f in source_dataset_list:
    df = pd.read_excel(f, sheet_name=None)
    cdf = pd.concat(df.values())
    out_df = out_df.append(cdf,ignore_index=True)

Another solution:

cdf = [pd.read_excel(excel_names, sheet_name=None).values() 
            for excel_names in source_dataset_list]

out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Hi @jezrael thanks for comment, i have tried using real dataset (12 excel files consists of 65 worksheets in total). The runtime is for both code are `1092.280` & `912.008` seconds. **Your second solution seems more efficient** :) Thanks! – yc.koong Apr 16 '19 at 08:07
  • 1
    and i found `pd.read_excel(f, sheet_name=None)` won't work, the correct approach should be `pd.read_excel(f, sheetname=None)`. No underscore for **sheetname** – yc.koong Apr 16 '19 at 08:10
  • @yc.koong - hmmm, it depends of version of pandas, in last one is used `sheet_name`, in previous `sheetname` – jezrael Apr 16 '19 at 08:16
  • Hi, thanks for this solution. upvoted. but how does it read all sheets in a excel file when the sheet_name is `None`. Shouldn't it read only the 1st excel sheet? – The Great Mar 18 '22 at 04:38
  • Based on this https://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook , I thought it will read only 1st sheet if you don't specify the sheet name. At least that;s what the solution says – The Great Mar 18 '22 at 04:42
2

If i understand your problem correctly, set sheet_name=None in pd.read_excel does the trick.

import os
import pandas as pd

path = "C:/Users/aaa/Desktop/Sample_dataset/"

dfs = [
    pd.concat(pd.read_excel(path + x, sheet_name=None))
    for x in os.listdir(path)
    if x.endswith(".xlsx") or x.endswith(".xls")
]

df = pd.concat(dfs)
pythonjokeun
  • 431
  • 2
  • 8
  • Hi @pythonjokeun, thanks for comment, i tried edit your code to meet my case here, changed `if x.endswith(".xlsx") or x.endswith(".xls")` to `if "Sales transaction" in x`. And i tried to run on real dataset (12 excel files consists of 65 worksheets in total, and around 4 million row of records), total runtime is `1040.364` seconds. :) – yc.koong Apr 16 '19 at 08:18
-1

I have a pretty straight forward solution if you want to read all the sheets.

import pandas as pd
df = pd.concat(pd.read_excel(path+file_name, sheet_name=None), 
               ignore_index=True)
Dharman
  • 30,962
  • 25
  • 85
  • 135
Reshma2k
  • 139
  • 1
  • 6