0

New to pandas doing some progress with self learning, so I want the best and efficient way to handle this:

I have 3 sometimes more than 3 excel files ".xlsx" each one is about 100MB and at least 800K records per file and 200 columns.

The files share the same columns exactly, they are split because they were exported from a system that cannot handle all of them combined.

I want to load the files in one dataframe, opening each one and then concat or append I know it will depend on the memory of the machine but I am looking for the best way to handle those files and control them in a one frame.

This is what I have:

start = timeit.default_timer()

all_data = pd.DataFrame()
for f in glob.glob("./data/*.xlsx"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

    
all_data

stop = timeit.default_timer()
execution_time = stop - start

print (execution_time)

With append it took about 7 minutes to load the files in the df all_data

Is there a best way to load them in a less time?

MTALY
  • 1,430
  • 2
  • 18
  • 26
  • Load the file and write into sql ? More Info https://stackoverflow.com/questions/28766133/faster-way-to-read-excel-files-to-pandas-dataframe – BENY May 28 '21 at 01:31
  • 2
    Reading each dataframe into a list of dataframes then calling `conact` once on the whole list of frames rather than sequentially appending would improve the performance significantly. – Henry Ecker May 28 '21 at 01:39
  • @HenryEcker I saved one of the files as `.csv` the size becomes almost double but it was faster than `.xlsx` is there a specific function to handle xlsx to csv before reading it in a dataframe? what is the possibility of data loss? Thanks – MTALY May 28 '21 at 02:33
  • I'm not the person to ask about that. I don't work with spreadsheets much. I do do a fair amount of pandas benchmarking, which I why I knew that the performance of `for append` is significantly worse than a single `concat` on a list. But in terms of getting your data into a more useable format or converting from xlsx to csv etc that's not may area. – Henry Ecker May 28 '21 at 02:53

1 Answers1

1

You can use multiprocessing to increase speed loading and use concat merge all dfs:

import pandas as pd
import multiprocessing
import glob
import time


def read_excel(filename):
    return pd.read_excel(filename)


if __name__ == "__main__":
    files = glob.glob("./data/*.xlsx")

    print("Sequential")
    print(f"Loading excel files: {time.strftime('%H:%M:%S', time.localtime())}")
    start = time.time()
    data = [read_excel(filename) for filename in files]
    end = time.time()
    print(f"Loaded excel files in {time.strftime('%H:%M:%S', time.gmtime(end-start))}")
    df_sq = pd.concat(data).reset_index(drop=True)

    print("Multiprocessing")
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        print(f"Loading excel files: {time.strftime('%H:%M:%S', time.localtime())}")
        start = time.time()
        data = pool.map(read_excel, files)
        end = time.time()
        print(f"Loaded excel files in {time.strftime('%H:%M:%S', time.gmtime(end-start))}")
        df_mp = pd.concat(data).reset_index(drop=True)

Example: 50 files of 25MB (gain 2x)

Sequential
Loading excel files: 09:12:17
Loaded excel files in 00:00:14
Multiprocessing
Loading excel files: 09:12:33
Loaded excel files in 00:00:07
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • `Sequential Loaded excel files in 00:03:32 Multiprocessing Loaded excel files in 00:01:31` – MTALY May 28 '21 at 16:11