0

I am trying to read 349 csv files, all with the same columns and c. 15gb in total, and combine them into 1 dataframe. However, I keep getting MemoryError, so have tried using a 10-20 second sleep every 10 files. My code below manages to read them into a list of dfs, although sometimes it crashes.

import glob
import os
import time
import pandas as pd 

path = r"C:\path\*\certificates.csv"
files = []
for filename in glob.iglob(path, recursive=True):
    files.append(filename) 
    #print(filename)

dfs = []
sleep_for = 20
counter = 0
for file in files: 
    counter += 1 
    if counter % 10 == 0:
        time.sleep(sleep_for)
        print("\nSleeping for " + str(sleep_for) + " seconds.\nProceeding to append df " + str(counter))
        df = pd.read_csv(file)
        df = df[keep_cols] # A list of cols to keep - same in every file
        dfs.append(df)        
    else:    
        df = pd.read_csv(file)
        df = df[domestic_keep_cols]
        dfs.append(df)
        print('Appending df ' + str(counter))
df_combined = pd.concat(dfs)

However, I when I try pd.concat on the list of dfs I get a MemoryError. I tried to work around this by appending 10 dfs at a time:

lower_limit = 0
upper_limit = 10
counter = 0

while counter < len(dfs):   
    counter += 1 
    target_dfs = dfs[lower_limit:upper_limit]
    if counter % 10 == 0:
        lower_limit += 10
        upper_limit += 10
        target_dfs = dfs[lower_limit:upper_limit]
        for each_df in target_dfs:
            df_combined = df_combined.append(each_df)
    else:
        for each_df in target_dfs:
            df_combined = df_combined.append(each_df)

However, this also throws MemoryError, is there a more efficient way to do this or is there something I am doing incorrectly which is throwing MemoryError? Or maybe pandas is the wrong tool for this job?

Maverick
  • 789
  • 4
  • 24
  • 45
  • The question is more 'does Pandas have a limit', right? – Josh Friedlander Feb 19 '19 at 12:18
  • 1
    Have you tried combining the files together? Do you have enough memory to keep the whole database in memory? – norok2 Feb 19 '19 at 12:18
  • Yes @JoshFriedlander, I forgot to change the question header from something I thought was worth asking but found the answer to! – Maverick Feb 19 '19 at 12:50
  • @norok2, how can I asses this? – Maverick Feb 19 '19 at 12:50
  • @Maverick It depends on your system, but typically the task manager tells you how much memory is your program asking for, so you could investigate there by loading say 1 GB of data from the disk and estimating how much that would require on a Pandas data frame by looking at the memory footprint of the Python instance that is loading the data. – norok2 Feb 19 '19 at 13:58

0 Answers0