The main bottleneck you need to be aware of here is neither disk nor CPU, it is memory, not how much you have, but how the hardware and OS work together to pre-fetch pages from RAM into the L{1,2,3} caches.
The parallel approach will have worse performance than the serial approach if you use readline() to load one line at a time. The reason has to do with hardware+OS, not software. When the cpu requires a memory read, a certain amount of extra data is fetched in to the Lx caches of the CPU in anticipation that this data might be required later. When you employ the serial approach, this extra data is in fact used while it is still in the cache. But when parallel readlines() are happening, the extra data is preempted before there is a chance to use it. Hence it has to be fetched again later. This has a huge impact on performance.
The way to make the parallel approach beat the performance of the serial readline() approach is to have your parallel processes read more than one line at a time into memory. Use read() instead of readline(). How many bytes should you read? Approximately the size of your L1 cache, about 64K. With this, several pages of contiguous memory can be loaded into that cache.
However, if you replace serial readline() with serial read() this will outperform parallel read(). Why? Because although each core has its own L1 cache, the cores are sharing the other L caches and therefore you're running into the same problem. Process 1 will pre-fetch to populate the caches, but before it has time to process it all, Process 2 takes over and replaces the contents of the cache. Therefore Process 1 will have to fetch the same data again later.
The performance difference between serial and parallel can be easily seen:
First make sure the whole file is in the page cache (free -m: buff/cache). By doing this you are totally removing the disk/block device layer from the equation. The whole file is in RAM.
Then run a simple serial code and time it.
Then run a simple parallel code using Process() and time it. On commodity systems, your serial reads will outperform your parallel reads.
This is contrary to what you expect, right? But you have to think about the assumptions you were making. Your assumptions didn't include the fact that there are caches and memory buses being shared between multiple cores. This is precisely where the bottleneck is located. We know disk isn't the bottleneck because we have loaded the entire file into page cache in RAM. We know CPU isn't the bottleneck because we are using multiprocessing.Process() which ensures simultaneous execution, one process per core (vmstat 1, top -d 1 %1).
The only way you can have better performance from the parallel approach is to make sure that you are running hardware with separate NuMA nodes where each core has its own memory bus and its own caches.
As a side note, contrary to what other answers have claimed, Python can certainly do true computational multiprocessing, where you put a 100% utilization on each core at the same time. This is done using multiprocessing.Process().
Thread Pools will not work for this because the threads are tied to a single core and are restricted by python's GIL. But multiprocessing.Process() is not restricted by the GIL and certainly does work.
Another side note: it is bad practice to try to load your entire input file into your heap. You never know if the file is larger than can fit into RAM which will cause an oom. So don't try doing this in an attempt to optimize the performance of your code.