Fast method to combine 100s Excels in a python dataframe

Question

Thank you for any help

I have a list with paths to excel files, say 622 excel file paths

excel_file_paths_list = ['path1', 'path2'...]

I found ways to concatenate them AFTER generating dfs, also know that concat is faster than append

df_from_each_file = (pd.read_excel(f) for f in excel_file_paths_list if f.endswith('.xlsx'))
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)

However, because I have 622 excel files it takes forever Any ideas to do it faster? Please

https://stackoverflow.com/questions/28766133/faster-way-to-read-excel-files-to-pandas-dataframe may be relevant here — Learning is a mess, Mar 31 '22 at 14:09
first part super fast (df_from_each_file) maybe because it doesn't do much and just a generator? — data_runner, Mar 31 '22 at 14:12
True that, you're doing lazy evaluation on the first line. Maybe make a list or tuple out of it to gauge what's consuming time. — Learning is a mess, Mar 31 '22 at 14:14
Can you please ''elaborate the gauging what's conusig time'' part please? Have no idea how to do that — data_runner, Mar 31 '22 at 14:15
I meant measuring how much it takes to import all the excel files, how much for the concatenation, and comparing both. See https://stackoverflow.com/questions/14452145/how-to-measure-time-taken-between-lines-of-code-in-python for timing python functions or lines of code — Learning is a mess, Mar 31 '22 at 14:23
Yes, Same number columns , Same column names, different row count, though — data_runner, Apr 04 '22 at 06:20

score 0 · Answer 1 · edited Apr 08 '22 at 09:00

0

You can try some multiprocessing to speed up the reading of all the files. Something like this

from multiprocessing import Pool, cpu_count

def read_excel(path):
    return pd.read_excel(path)

if __name__ == "__main__":
    

    with Pool(cpu_count()) as p:
        # get list of excel files to parse
        excel_file_paths_list = ['path1', 'path2'...]
        files = [f for f in excel_file_paths_list if f.endswith('.xlsx')]

        # multi process the reading of excel files (returns a list)
        excel_dfs = p.map(read_excel, files)

        # concat all dfs from excel
    concatenated_df = pd.concat(excel_dfs, ignore_index=True)

edited Apr 08 '22 at 09:00

Learning is a mess

7,479
7
35
71

answered Mar 31 '22 at 16:01

mortice

84
4

Thank you for your contribution, However, tried with 44 excels - and the method without exploiting the multi-process was 1 min 44 sec, while the multi-process one - Had to interrupt after 3 minutes (with no result) – data_runner Apr 01 '22 at 12:01
1

How much speed up are you looking for? In the link I shared hereabove, someone proposes to convert xlsx -> csv first so as to speed up 2x, would that be enough for you? – Learning is a mess Apr 05 '22 at 09:25

Fast method to combine 100s Excels in a python dataframe

1 Answers1