0

In the code below I'm merging all csv files starting with a certain date contained in the variable: file_date. The code is working perfectly for small and moderate sized csv files but crashes with very large csv files.

path = '/Users/Documents/'+file_date+'*'+'-details.csv'+'*'
    allFiles = glob.glob(path)
    frame = pd.DataFrame()
    list_ = []
    for file_ in allFiles:
        frame = pd.read_csv(file_,index_col=None, header=0)
        print frame.shape 
        list_.append(frame)
        df = pd.concat(list_)
        print df.shape

    df.to_csv('/Users/Documents/'+file_date+'-details.csv',sep=',', index = False)

Can I process each file in chunks? if yes, how do I do that?

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
iprof0214
  • 701
  • 2
  • 6
  • 19
  • you might want to not use pandas since `read_csv` loads everything into memory, use the csv module or regular python and just go through the files line by line – Primusa Sep 26 '18 at 19:15
  • `df = pd.concat(list_)` should be outside of loop. But migt be that your indentation is wrong. Can yo fix it? – Anton vBR Sep 26 '18 at 19:29

2 Answers2

0

Good question, sir! Python supports the concept of ‚generators‘ to execute tasks in a particular iterator like fashion. This is often used in the context of partitioning tasks like reading file chunk by chunk. In your case you would not only read a file in this way, but also read another one and concatenate it with another one (read to the end of the first, then add the next step by step). See these answers on how to use a generator in this context:

Lazy Method for Reading Big File in Python?

Peter Branforn
  • 1,679
  • 3
  • 17
  • 29
0

If you don't process the files, you don't even need pandas. Just read the files line by line and write it to the new file:

with open('outfile.csv', 'w') as outfile:
    for i, filename in enumerate(all_files):
        with open(filename, 'r') as infile:
            for rownum, line in enumerate(infile):
                if (i != 0) and (rownum == 0):    # Only write header once
                    continue
                outfile.write(line + '\n')
Andy
  • 450
  • 2
  • 8