0

How can I read multiple large excel files from a folder which is of 1.98 gb, and read all sheets of each and every file and append them into one dataframe. I am new at python and I have tried something like below but it takes more than half an hour -

with warnings.catch_warnings(record = True):
    warnings.simplefilter("always")
    for file_name in os.listdir(file_path):
        if '~$' in file_name:
            continue
        else:
            xls = pd.ExcelFile(os.path.join(file_path,file_name))
            for sheets in xls.sheet_names:
                df = df.append(xls.parse(sheets))
Zeren
  • 27
  • 4

1 Answers1

0

It is perfectly fine to execute code for so long time, because Python is single threaded programming language. This is why I wouldn't recommend you doing (if u decided to make it) in multiple threads because you can encounter OOM error or loose some info during merging them at one time.

To combine each sheet into one dataframe there is already an answer here - python efficient way to append all worksheets in multiple excel into pandas dataframe

Here is a short summary which might help you

out_df = pd.DataFrame()
for f in source_dataset_list:
    df = pd.read_excel(f, sheet_name=None)
    cdf = pd.concat(df.values())
    out_df = out_df.append(cdf,ignore_index=True)
  • thanks for your answer, does chunksize help to reduce time? – Zeren Mar 08 '23 at 17:54
  • @Zeren It might work a bit longer, but this is something you should try by urself. One way or another, by using this method you almost certainly avoid OOM problem – Master_Sniffer Mar 08 '23 at 23:26
  • 1
    got it, I saw the solution and I think for the first answer the second solution is more efficient, how can I read the excel file from a directory and I need to exclude the file which has 'A001' in the file name? I am unable to make the changes could you please help – Zeren Mar 09 '23 at 07:21
  • @Zeren so... If you want to read it from directory use this guide - https://www.geeksforgeeks.org/how-to-read-all-excel-files-under-a-directory-as-a-pandas-dataframe/ Or you can specify some file, for example " pd.read_excel('tmp.xlsx', index_col=0) " (of course, instead of tmp use other file) – Master_Sniffer Mar 09 '23 at 19:28
  • @Zeren If it is needed, i can duplicate my comment as an another answer About A001 import pandas as pd import glob # getting excels from Dir path = "C:\Users\some_user\Desktop\" #or use here your path filenames = glob.glob(path + "\*.xlsx") print('File names:', filenames) # just to make sure that every file will be included # loop over all excel files for file in filenames: # reading excel files if ("A001" not in file): print("Reading file = ",file) print(pd.read_excel(file)) else: print(f"This file has A001 in name ", {file}) – Master_Sniffer Mar 09 '23 at 19:35