1

I have a folder that contains around 5,000 CSV files. Each CSV contains the same header information. I would like to merge these CSVs into one large DataFrame. I have implemented some code that is working, but I have a feeling it isn't the best way to accomplish this.

def loadData(filename):
    
    path = 'data/1611339510/' + filename
    file = open(path,mode='r')
    f_string = file.read()
    file.close()
    
    delimiter = detect(f_string)
    
    df = pd.read_csv(path, delimiter=delimiter, usecols=['Date', 'Open', 'High', 'Low', 'Close', 'Volume'])

    return df
def concatDataframes():
    column_names = ["Symbol", "Date", "Open", "High", "Low", "Close", "Volume"]
    df = pd.DataFrame(columns = column_names)

    for filename in os.listdir('data/1611339510/'):
        
        clear_output(wait=True)
        tempDF = loadData(filename)
        tempDF["Symbol"] = filename[:-4]
        tempDF = tempDF[column_names]
        print('loaded ' + filename + ' ✓')
        df = df.append(tempDF)

    return df

This works okay, but as it progresses the speed drops significantly. It may start off joining 500 DataFrames/s but it will end up taking several seconds per DataFrame once I'm about 25% of the way through my folder. The folder contains about 700Mb total of data. I am running a decent machine with 32Gb of ram and a 10th gen i7. There doesn't appear to be any memory leaks or anything.

I'm wondering if trying to join these dataframes symmetrically in smaller chunks first might help?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Kyle Dixon
  • 474
  • 2
  • 10
  • 2
    Append to a **list**, concatenate at the end of the loop. DataFrame.append needlessly copies all the data with every append, which is fine for 1-2 appends, but obviously really bad for 5,000+ appends. – ALollz Jan 22 '21 at 21:16
  • 2
    @ALollz This worked perfectly. Loaded the entire dataset in less than one minute. Thanks again! – Kyle Dixon Jan 22 '21 at 21:25

0 Answers0