Understanding Python multiprocessing

Question

I made a script that read files line per line, but I have a large file (32GB), so it will take a long time to complete.

Here is where multiprocessing comes in, to make this faster, but I don't understand very well this read_in_chunks function, can someone help me?

Here's the script:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

 f = open('teste.txt')
 for piece in read_in_chunks(f):
    print piece

Thanks for all.

UPDATE Sorry I forgot to say that with the line I will insert in the MySQL DB

Maybe it's useful for you: http://stackoverflow.com/questions/18104481/read-large-file-in-parallel — FacundoGFlores, May 09 '15 at 03:29
What are you doing with each piece of data? Depending on what you are doing your problem may be I/O bound, not CPU bound and therefore you won't gain any speed-up from multi-processing. — Martin Konecny, May 09 '15 at 03:46
Reason for downvote: your question is not about multiprocessing at all. — Filip Malczak, May 09 '15 at 09:53

score 3 · Answer 1 · answered May 09 '15 at 06:09

read_in_chunks is a function that returns some number of bytes, the chunk_size, in the file. read_in_chinks is a generator and uses the yield operator so that these chunks are not stored into your computer's memory until they are needed. You say your script reads 'line by line', well technically it reads 'chunk' by 'chunk'. This distinction may seem pedantic but it is important to note.

Reading a file in parallel is not going to give you any performance gains. (Assuming a normal computer setup) the physical hard drive only has a single read-write-head, so there is literally no way for the head to be in two places at once, reading two parts of a file. Imagine your eyeballs trying to, at the same exact time, read from two pages at once. Not going to happen. Thus reading a file is known as Input/Output Bound (I/O Bound), and more processes cannot speed up reading the file.

However, more process can help speed up what you do with the data you read from the file.

Right now the operation you run on the data you read from the file is called print. If you were to add a multiprocessing element to your code, this would where it would occur. Your main process would read several chunks of data. Then each chunk would be passed to a separate process, each process would then print the chunk. Obviously print is not a cpu intensive operation so multiprocessing in this way is useless and would do more damage than good, considering the overhead to spawn new processes.

However if the operation on the data was cpu intensive, for example a complex algorithm that took a string of texts and computed its Weissman Score, multiprocessing would be beneficial.

The main process would read large chunks of data and pass each chunk to a separate process. Each process would calculate the Weismann Score of the data, and then return that info to the main process.

Here is some psuedo code:

 def calc_weissman_score(chunk_of_data):
     # a bunch of cpu intensive stuff here that take a lot of time
     print 42

 f = open('teste.txt')
 gigabyte = 1000000000
 process_pool = 5 processes # use multiprocessing module for this
 for piece in read_in_chunks(f, chunk_size=gigabyte):
     if there are not processes in the process pool:
          wait until there are processes in the process pool
      spawn a new process that calls calc_weissman_score(piece)

In short, multiprocessing is not going to help you read data from a file, but it may speed up the time it takes you to process that data.

Good answer. Just wanted to add that the [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html) module has a [`Pool`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool) class which makes managing parallel tasks like this quite easy. — JKillian, May 09 '15 at 14:56

score 0 · Answer 2 · edited May 23 '17 at 12:02

Your read_in_chunks function just gives you a generator object that reads a file chunk by chunk. There's nothing going on in parallel and you're not going to see any speedup.

In fact, reading a file in parallel isn't likely to give you any speedup. Think about it at a very basic hardware level: you can only read data from one location on a hard drive at any given moment. Sequentially reading a file is going to be just as fast as any parallel attempt.

I think this answer gives a good overall picture on working with large files that will help you.

Understanding Python multiprocessing

2 Answers2