2

Task at hand - Convert a CSV file into a pipe separated file with some other modifications in the data present in each line.

I'm reading the data from a big file (~5 GB) one line at a time. Performing the necessary modifications on the data and then finally writing the data into the output file.

I started with the raw code, and started optimizing it based on the suggestions of people:

Started using file object as the iterator as suggested here Read large text files in Python, line by line without loading it in to memory

Started writing data in batches as suggested here Speed up writing to files

My code looks like this right now:

import time
outFile = open('outfile.csv', 'w')

bunchsize = 1000000
bunch = []

with open("filename.csv", 'r', 567772160) as infile:
    for line in infile:
       try:
        #PERFORMING THE MODIFICATIONS IN INPUT LINE HERE
        temp = result

        #Generating the output line
        out = ''.join(temp) + '\n'

        #Writing into outfile
        bunch.append(out)
        if len(bunch) == bunchsize:
            outFile.writelines(bunch)
            bunch = []

    except:
        continue

    outFile.writelines(bunch)

infile.close()
outFile.close()

I assume this code can be further optimized by using a separate thread for the file writing to do asynchronous writing and another thread to do the modifications on the input lines.

I'd like to know how to implement threading into this code. I've gone through a lot of examples on threading but couldn't find anything related to what I'm trying to do here.

Edit: I think it might help to mention that when executing the code, both my cpu utilization and Disk Usage are < 10 % and it takes me ~20 minutes to finish the execution for a 5 GB file.

noobcoder
  • 83
  • 7
  • threading won't make the I/O bottleneck go away... and please don't use `os` as a variable name as it's the operating system package and it's used _a lot_ – Jean-François Fabre Jun 19 '17 at 08:17
  • @Jean-FrançoisFabre Since I'm writing the data in batches, wouldn't it be more efficient to proceed with the computation required for the modifications of the input buffer on a separate thread while the data is being written into the disk on another thread? – noobcoder Jun 19 '17 at 08:24
  • yes that would work, but it's difficult for us to rewrite your code fully so it does what you want. You should rework your code to try that, and if it doesn't work [edit] your question with a [mcve] to explain your problems. – Jean-François Fabre Jun 19 '17 at 08:26
  • @Jean-FrançoisFabre I've edited the code to contain the bare minimum. – noobcoder Jun 19 '17 at 08:33

0 Answers0