3

I have created a class which loops through a file and after checking if a line is valid, it'll write that line to another file. Every line it checks is a lengthy process making it very slow. I need to implement either threading/multiprocessing at the process_file function; I do not know which library is best suited for speeding this function up or how to implement it.

class FileProcessor:
    def process_file(self):
        with open('file.txt', 'r') as f:
            with open('outfile.txt', 'w') as output:
                for line in f:
                    # There's some string manipulation code here...
                    validate = FileProcessor.do_stuff(self, line)
                    # If true write line to output.txt
    def do_stuff(self, line)
        # Does stuff...
        pass 

Extra Information: The code goes through a proxy list checking whether it is online. This is a lengthy and time consuming process.

Thank you for any insight or help!

Taylor
  • 73
  • 1
  • 6
  • You could use [mmap](https://docs.python.org/2/library/mmap.html) to read the file more efficiently, do all your validation, and then do your writes to the output file. Have you tried that? – Maurice Reeves Jun 05 '16 at 14:38
  • Is the question title misleading? I think you're looking for ways to parallelize file processing because IO is faster than computation in your case (did you confirm this btw?). – Jasper Jun 05 '16 at 14:41
  • Sorry, I guess I didn't explain myself well enough. My problem is that the code looks at one line of code, does some stuff with that line and puts the line in a function which returns true or false, and then repeats with the next line, so forth. I'm trying to get it so it goes through multiple lines at once. – Taylor Jun 05 '16 at 14:49

2 Answers2

1

The code goes through a proxy list checking whether it is online

It sounds like what takes a long time is connecting to the internet, meaning your task is IO bound and thus threads can help speed it up. Multiple processes are always applicable but can be harder to use.

Alex Hall
  • 34,833
  • 5
  • 57
  • 89
  • Yes, it is connecting to the internet which is taking all of the time. I do not how to implement a thread so that it goes through through multiple lines at once, checking whether that proxy is available. – Taylor Jun 05 '16 at 14:43
  • @Taylor you have to start up several threads that all run the same code. You should probably avoid having them all access the same files. – Alex Hall Jun 05 '16 at 14:48
  • I need the threads to access the same files. Wouldn't it be accessing it one after another anyway? – Taylor Jun 05 '16 at 15:02
  • Unless the files are very large you can simply read them into a list and store the results in a list which is only written to a file after the threads are done. Otherwise use locks. – Alex Hall Jun 05 '16 at 15:03
0

This seems like a job for multiprocessing.map.

import multiprocessing

def process_file(filename):
    pool = multiprocessing.Pool(4)
    with open(filename) as fd:
        results = pool.imap_unordered(do_stuff, (line for line in fd))
        with open("output.txt", "w") as fd:
            for r in results:
                fd.write(r)

def do_stuff(item):
    return "I did something with %s\n" % item

process_file(__file__)

You can also use multiprocessing.dummy.Pool instead if you want to use threads (which might be preferable in this case since your are I/O bound).

Essentially you are passing an iterable to imap_unordered (or imap if order matters) and farming out portions of it to other processes (or threads if using dummy). You can tune the chunksize of the map to help with efficiency.

If you want to encapsulate this into a class, you'll need to use multiprocessing.dummy. (Otherwise it can't pickle the instance method.)

You do have to wait until the map finishes before you can process the results, although you could write the results in do_stuff instead -- just be sure to open the file in append mode, and you'll likely want to lock the file.

Community
  • 1
  • 1
rrauenza
  • 6,285
  • 4
  • 32
  • 57