Is there a better way to improve the concat speed?

Question

I'm a student in Korea and I'm using python to analyze option data(finance). I'm finding a better way to speeding up the performance of my python code.

Target data is the transaction record(per minute) of the options and the period is from 2015 to 2019. Because the data is divided into 1227(the number of workdays during 5 years) files(txt), I tried to concatenate all 1227 files to minimize the number of accession to the memory. This is because I will use the result file(concatenated file = preprocessed file) repeatedly and accessing every separated file took too much time. Below is some part of my code.

#file_name is list type and it contains all names of the 1227 day files ordered by date

result_df = pd.DataFrame()
for f in file_name: 

    data_opt = pd.read_csv(location + f, header = None, sep = "\t")

    #do something
    #...
    #...

    oneday_df = pd.concat([minute_call, minute_put], axis = 0) #result of the processing one day data

    result_df = pd.concat([result_df, oneday_df], axis = 0)

result_df.to_csv()

This code works and I could get the proper result. However, I could see that the speed slowed down as time goes by. It means that my code works fast when it processes early data but its speed slows down when it processes late data. Is there any better way to speeding up the performance of my python code?

(Sorry for my awkward English and thank you for reading all questions)

You can move the `result_df = pd.concat(...)` outside the loop by saving all `oneday_df` into the list and then do `result_df = pd.concat(list_with_oneday_dfs, axis = 0)` — Andrej Kesely, Sep 27 '20 at 08:59
Making multiple concat calls is not ideal. You can instead read up on this stackoverflow question which addresses a similar concern : https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe and this : https://stackoverflow.com/questions/57000903/what-is-the-fastest-and-most-efficient-way-to-append-rows-to-a-dataframe — Amit Singh, Sep 27 '20 at 09:04
thank you very much! I learned the new idea! I will try it right now — ts10, Sep 27 '20 at 09:28

score 2 · Accepted Answer · answered Sep 27 '20 at 09:02

Rather than concatenating in memory, keep the output CSV file open and write each part to it separately as you go?

That way you'll never have more than one day's worth of data in memory at a time, improving not only speed but also memory consumption.

Something like:

with open('out_file.csv', 'w') as of:
    for i, f in enumerate(file_name): 

        data_opt = pd.read_csv(location + f, header = None, sep = "\t")

        #do something
        #...
        #...

        oneday_df = pd.concat([minute_call, minute_put], axis = 0) #result of the processing one day data

        is_first_part = (i == 0)
        oneday_df.to_csv(of, header=is_first_part)

Thank you very much for your great suggestion! I will edit my code! — ts10, Sep 27 '20 at 09:05
the method you provided showed constant and acceptable speed! thank you sabik! — ts10, Sep 27 '20 at 11:22

score 2 · Answer 2 · answered Sep 27 '20 at 09:19

And I think rather than concatenating inside the for-loop, store those dataframes in list and concatenate them after the for-loop.

Something like:

minute_something = []
for f in file_name: 

    data_opt = pd.read_csv(location + f, header = None, sep = "\t")

    #do something
    #...
    #...
        
    minute_something.append(minute_put)
    minute_something.append(minute_call) #result of the processing one day data

result_df = pd.concat(munute_something, axis=0)

score 1 · Answer 3 · answered Sep 27 '20 at 10:03

I timed two approaches to combine 1000 csv files with 100 rows and 10 columns

[19.50 s] collect dataframes in a list and save the concatenated dataframes
[09.84 s] iteratively append to a csv file

First generated the test files and got the filenames in a list

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(100,10))

for i in range(1000):
    df.to_csv(f'../data/test/file{i:04d}.csv', index=False)

import glob
files = glob.glob('../data/test/file*.csv')

Then I timed both methods
append to list, concat all, to_csv

%%time
l = []
for file in files:
    l.append(pd.read_csv(file))
pd.concat(l).to_csv('../data/test/concat_files.csv', index=False)

Out:

CPU times: user 4.7 s, sys: 547 ms, total: 5.25 s
Wall time: 19.5 s

append mode to csv file

%%time
pd.read_csv(files[0])[:0].to_csv('../data/test/append_files.csv', index=False)
for file in files:
    pd.read_csv(file).to_csv('../data/test/append_files.csv', mode='a', header=False, index=False)

Out:

CPU times: user 8.09 s, sys: 1.19 s, total: 9.28 s
Wall time: 9.84 s

It is a very clear comparison! thank you for your measurement! I tried the second method and it worked well! — ts10, Sep 27 '20 at 11:21

Is there a better way to improve the concat speed?

3 Answers3