Task at hand - Convert a CSV file into a pipe separated file with some other modifications in the data present in each line.
I'm reading the data from a big file (~5 GB) one line at a time. Performing the necessary modifications on the data and then finally writing the data into the output file.
I started with the raw code, and started optimizing it based on the suggestions of people:
Started using file object as the iterator as suggested here Read large text files in Python, line by line without loading it in to memory
Started writing data in batches as suggested here Speed up writing to files
My code looks like this right now:
import time
outFile = open('outfile.csv', 'w')
bunchsize = 1000000
bunch = []
with open("filename.csv", 'r', 567772160) as infile:
for line in infile:
try:
#PERFORMING THE MODIFICATIONS IN INPUT LINE HERE
temp = result
#Generating the output line
out = ''.join(temp) + '\n'
#Writing into outfile
bunch.append(out)
if len(bunch) == bunchsize:
outFile.writelines(bunch)
bunch = []
except:
continue
outFile.writelines(bunch)
infile.close()
outFile.close()
I assume this code can be further optimized by using a separate thread for the file writing to do asynchronous writing and another thread to do the modifications on the input lines.
I'd like to know how to implement threading into this code. I've gone through a lot of examples on threading but couldn't find anything related to what I'm trying to do here.
Edit: I think it might help to mention that when executing the code, both my cpu utilization and Disk Usage are < 10 % and it takes me ~20 minutes to finish the execution for a 5 GB file.