I have a folder that contains around 5,000 CSV files. Each CSV contains the same header information. I would like to merge these CSVs into one large DataFrame. I have implemented some code that is working, but I have a feeling it isn't the best way to accomplish this.
def loadData(filename):
path = 'data/1611339510/' + filename
file = open(path,mode='r')
f_string = file.read()
file.close()
delimiter = detect(f_string)
df = pd.read_csv(path, delimiter=delimiter, usecols=['Date', 'Open', 'High', 'Low', 'Close', 'Volume'])
return df
def concatDataframes():
column_names = ["Symbol", "Date", "Open", "High", "Low", "Close", "Volume"]
df = pd.DataFrame(columns = column_names)
for filename in os.listdir('data/1611339510/'):
clear_output(wait=True)
tempDF = loadData(filename)
tempDF["Symbol"] = filename[:-4]
tempDF = tempDF[column_names]
print('loaded ' + filename + ' ✓')
df = df.append(tempDF)
return df
This works okay, but as it progresses the speed drops significantly. It may start off joining 500 DataFrames/s but it will end up taking several seconds per DataFrame once I'm about 25% of the way through my folder. The folder contains about 700Mb total of data. I am running a decent machine with 32Gb of ram and a 10th gen i7. There doesn't appear to be any memory leaks or anything.
I'm wondering if trying to join these dataframes symmetrically in smaller chunks first might help?