I am trying to read 349 csv files, all with the same columns and c. 15gb in total, and combine them into 1 dataframe. However, I keep getting MemoryError
, so have tried using a 10-20 second sleep every 10 files. My code below manages to read them into a list of dfs, although sometimes it crashes.
import glob
import os
import time
import pandas as pd
path = r"C:\path\*\certificates.csv"
files = []
for filename in glob.iglob(path, recursive=True):
files.append(filename)
#print(filename)
dfs = []
sleep_for = 20
counter = 0
for file in files:
counter += 1
if counter % 10 == 0:
time.sleep(sleep_for)
print("\nSleeping for " + str(sleep_for) + " seconds.\nProceeding to append df " + str(counter))
df = pd.read_csv(file)
df = df[keep_cols] # A list of cols to keep - same in every file
dfs.append(df)
else:
df = pd.read_csv(file)
df = df[domestic_keep_cols]
dfs.append(df)
print('Appending df ' + str(counter))
df_combined = pd.concat(dfs)
However, I when I try pd.concat
on the list of dfs I get a MemoryError
. I tried to work around this by appending 10 dfs at a time:
lower_limit = 0
upper_limit = 10
counter = 0
while counter < len(dfs):
counter += 1
target_dfs = dfs[lower_limit:upper_limit]
if counter % 10 == 0:
lower_limit += 10
upper_limit += 10
target_dfs = dfs[lower_limit:upper_limit]
for each_df in target_dfs:
df_combined = df_combined.append(each_df)
else:
for each_df in target_dfs:
df_combined = df_combined.append(each_df)
However, this also throws MemoryError
, is there a more efficient way to do this or is there something I am doing incorrectly which is throwing MemoryError
? Or maybe pandas is the wrong tool for this job?