How can I parallelize a csv merging algorithm in python?

Question

I have a folder of 100gb of csv files that I want to merge into a single csv. The file names are in order of row position. I've written a single threaded script to tackle this, but it is understandably slow.

def JoinRows(rows_to_join, init=True):
    #rows_to_join is a list of csv paths.
    for i, row in enumerate(rows_to_join):
        with open('join_rows.csv', 'a') as f1:
            #join_rows.csv is just the output file with all the rows
            with open(row, 'r') as f2:
                for line in f2:
                    f1.write('\n'+line)

I also wrote a recursive function that doesn't work and isn't parallel (yet). My thought was to join each csv with another, delete the second of the two, and keep repeating until only one file was left. This way the task could be split up among different available threads. Any suggestions?

 def JoinRows(rows_to_join, init=False):
     if init==True: rows_to_join.sort()
     LEN = len(rows_to_join)
     print(LEN)
     if len(rows_to_join) == 2:
         with open(rows_to_join[0], 'a') as f1:
             with open(rows_to_join[1], 'rb') as f2:
                 for line in f2:
                     f1.write('\n'+line)
         subprocess.check_call(['rm '+rows_to_join[1]], shell=True)
         return(rows_to_join[1])
     else:
         rows_to_join.remove(JoinRows(rows_to_join[:LEN//2]))
         rows_to_join.remove(JoinRows(rows_to_join[LEN//2:]))

Your bottleneck is writing to disk and parallel processing doesn't change that. Don't waste your time complicating it that way. — Steven Rumbalski, Jan 18 '19 at 23:03
It may slightly help to move the `open`ing of the output file out of the `for` loop so that it is opened only once and closed after all files were processed. Additionally you may not have to read the input files line by line but read each as a whole (if not too large) and write it out as a whole block of data (maybe with additonal newline). — Michael Butscher, Jan 18 '19 at 23:05
if you still want to do it effectively use pandas . . [helpful link](https://stackoverflow.com/questions/36526282/append-multiple-pandas-data-frames-at-once) — sahasrara62, Jan 18 '19 at 23:09
In these two lines, `for line in f2: f1.write('\n'+line)`, why are you prepending a newline in front of each line? Do you want an extra blank line between each row? Are your files well-formed in that they all end in a newline? Assuming that they are, and that the extra newline is a bug, don't loop the write, just do `f1.write(f2.read())`. I notice that you aren't skipping a header row. Do your files lack header rows? — Steven Rumbalski, Jan 18 '19 at 23:16
@prashantrana: Sometimes `pandas` is not the answer. Here it's definitely not. Did you notice that there are 100gb of files? And you want them all parsed into DataFrames and concatenated in memory? Even if memory wasn't an issue, why deal with parsing files when all you want to do is concatenate them? It's not going to be faster. This is solvable merely as a text processing problem and pure Python will excel at it. — Steven Rumbalski, Jan 18 '19 at 23:21
@StevenRumbalski that makes perfect sense since there is no other text processing going on. It made me realize doing it recursively would slow things down exponentially since you'd actually be parsing and writing the same data many times. — Joe B, Jan 18 '19 at 23:53

How can I parallelize a csv merging algorithm in python?

0 Answers0