read_in_chunks
is a function that returns some number of bytes, the chunk_size
, in the file. read_in_chinks
is a generator and uses the yield
operator so that these chunks are not stored into your computer's memory until they are needed.
You say your script reads 'line by line', well technically it reads 'chunk' by 'chunk'. This distinction may seem pedantic but it is important to note.
Reading a file in parallel is not going to give you any performance gains. (Assuming a normal computer setup) the physical hard drive only has a single read-write-head, so there is literally no way for the head to be in two places at once, reading two parts of a file.
Imagine your eyeballs trying to, at the same exact time, read from two pages at once. Not going to happen.
Thus reading a file is known as Input/Output Bound (I/O Bound), and more processes cannot speed up reading the file.
However, more process can help speed up what you do with the data you read from the file.
Right now the operation you run on the data you read from the file is called print
. If you were to add a multiprocessing element to your code, this would where it would occur.
Your main process would read several chunks of data. Then each chunk would be passed to a separate process, each process would then print the chunk.
Obviously print is not a cpu intensive operation so multiprocessing in this way is useless and would do more damage than good, considering the overhead to spawn new processes.
However if the operation on the data was cpu intensive, for example a complex algorithm that took a string of texts and computed its Weissman Score, multiprocessing would be beneficial.
The main process would read large chunks of data and pass each chunk to a separate process. Each process would calculate the Weismann Score of the data, and then return that info to the main process.
Here is some psuedo code:
def calc_weissman_score(chunk_of_data):
# a bunch of cpu intensive stuff here that take a lot of time
print 42
f = open('teste.txt')
gigabyte = 1000000000
process_pool = 5 processes # use multiprocessing module for this
for piece in read_in_chunks(f, chunk_size=gigabyte):
if there are not processes in the process pool:
wait until there are processes in the process pool
spawn a new process that calls calc_weissman_score(piece)
In short, multiprocessing is not going to help you read data from a file, but it may speed up the time it takes you to process that data.