0

I'm trying to read and write data from a large file ~300 million lines and ~200 GB with Python. I've been able to get the basic code to work, but would like to parallelize it so that it runs faster. To do so I've been following this guide: https://www.blopig.com/blog/2016/08/processing-large-files-using-python/. However, I when I try to parallelize the code I get an error: "TypeError: worker() argument after * must be an iterable, not int". How can I get the code to run and do you have any suggestions for increasing the efficiency? Please note that I'm relatively new to Python.

Basic code (where id_pct1 and id_pct001 are sets):

with open(file1) as f, open('file1', 'w') as out_f1, open('file2', 'w') as out_f001:
        for line in f:
            data = line.split('*')
            if data[30] in id_pct1: out_f1.write(line)
            if data[30] in id_pct001: out_f001.write(line)

Parallel code:

def worker(lineByte):
      with open(file1) as f, open('file1', 'w') as out_f1, open('file2', 'w') as out_f001:
            f.seek(lineByte)
            line = f.readline()
            data = line.split('*')
            if data[30] in id_pct1: out_f1.write(line)
            if data[30] in id_pct001: out_f001.write(line)


def main():
   pool = mp.Pool()
   jobs = []

   with open('Subsets/FirstLines.txt') as f:
        nextLineByte = 0
        for line in f:
            jobs.append(pool.apply_async(worker,(nextLineByte)))
            nextLineByte += len(line)

        for job in jobs:
            job.get()

        pool.close()

if __name__ == '__main__':
    main()
  • I think you're just missing a comma after `nextLineByte` See https://stackoverflow.com/questions/49947814/python-threading-error-must-be-an-iterable-not-int – pcarter Nov 25 '19 at 19:51

1 Answers1

0

Try with

 jobs.append(pool.apply_async(worker,(nextLineByte,)))

pool.apply_async() needs an iterable.

(nextLineByte) acts as an int, which is the thrown error.

em_bis_me
  • 388
  • 1
  • 8
  • 1
    Thanks! I also realized I had some other problems. The code works when I also change the code to nextLineByte += len(line)+1 and to add a "listener" per https://stackoverflow.com/questions/22147166/parallel-excution-and-file-writing-on-python – giacomo1488 Nov 26 '19 at 14:12