Handling MemoryError in Pandas

Question

I am trying to read 349 csv files, all with the same columns and c. 15gb in total, and combine them into 1 dataframe. However, I keep getting MemoryError, so have tried using a 10-20 second sleep every 10 files. My code below manages to read them into a list of dfs, although sometimes it crashes.

import glob
import os
import time
import pandas as pd 

path = r"C:\path\*\certificates.csv"
files = []
for filename in glob.iglob(path, recursive=True):
    files.append(filename) 
    #print(filename)

dfs = []
sleep_for = 20
counter = 0
for file in files: 
    counter += 1 
    if counter % 10 == 0:
        time.sleep(sleep_for)
        print("\nSleeping for " + str(sleep_for) + " seconds.\nProceeding to append df " + str(counter))
        df = pd.read_csv(file)
        df = df[keep_cols] # A list of cols to keep - same in every file
        dfs.append(df)        
    else:    
        df = pd.read_csv(file)
        df = df[domestic_keep_cols]
        dfs.append(df)
        print('Appending df ' + str(counter))
df_combined = pd.concat(dfs)

However, I when I try pd.concat on the list of dfs I get a MemoryError. I tried to work around this by appending 10 dfs at a time:

lower_limit = 0
upper_limit = 10
counter = 0

while counter < len(dfs):   
    counter += 1 
    target_dfs = dfs[lower_limit:upper_limit]
    if counter % 10 == 0:
        lower_limit += 10
        upper_limit += 10
        target_dfs = dfs[lower_limit:upper_limit]
        for each_df in target_dfs:
            df_combined = df_combined.append(each_df)
    else:
        for each_df in target_dfs:
            df_combined = df_combined.append(each_df)

However, this also throws MemoryError, is there a more efficient way to do this or is there something I am doing incorrectly which is throwing MemoryError? Or maybe pandas is the wrong tool for this job?

Have you tried combining the files together? Do you have enough memory to keep the whole database in memory? — norok2, Feb 19 '19 at 12:18
Yes @JoshFriedlander, I forgot to change the question header from something I thought was worth asking but found the answer to! — Maverick, Feb 19 '19 at 12:50
@Maverick It depends on your system, but typically the task manager tells you how much memory is your program asking for, so you could investigate there by loading say 1 GB of data from the disk and estimating how much that would require on a Pandas data frame by looking at the memory footprint of the Python instance that is loading the data. — norok2, Feb 19 '19 at 13:58

Handling MemoryError in Pandas

0 Answers0