0

Thank you for any help

I have a list with paths to excel files, say 622 excel file paths

excel_file_paths_list = ['path1', 'path2'...]

I found ways to concatenate them AFTER generating dfs, also know that concat is faster than append

df_from_each_file = (pd.read_excel(f) for f in excel_file_paths_list if f.endswith('.xlsx'))
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)

However, because I have 622 excel files it takes forever Any ideas to do it faster? Please

1 Answers1

0

You can try some multiprocessing to speed up the reading of all the files. Something like this

from multiprocessing import Pool, cpu_count

def read_excel(path):
    return pd.read_excel(path)

if __name__ == "__main__":
    

    with Pool(cpu_count()) as p:
        # get list of excel files to parse
        excel_file_paths_list = ['path1', 'path2'...]
        files = [f for f in excel_file_paths_list if f.endswith('.xlsx')]

        # multi process the reading of excel files (returns a list)
        excel_dfs = p.map(read_excel, files)

        # concat all dfs from excel
    concatenated_df = pd.concat(excel_dfs, ignore_index=True)
Learning is a mess
  • 7,479
  • 7
  • 35
  • 71
mortice
  • 84
  • 4
  • Thank you for your contribution, However, tried with 44 excels - and the method without exploiting the multi-process was 1 min 44 sec, while the multi-process one - Had to interrupt after 3 minutes (with no result) – data_runner Apr 01 '22 at 12:01
  • 1
    How much speed up are you looking for? In the link I shared hereabove, someone proposes to convert xlsx -> csv first so as to speed up 2x, would that be enough for you? – Learning is a mess Apr 05 '22 at 09:25