I'm writing a python script in which I read a big file ~5 GB line by line, make some modifications in each of the lines, and then write it to another file.
When I use the function file.readlines() for reading the input file, my disk usage reaches ~90% and the disk speed reaches +100Mbps (i know this method shouldn't be used for large files).
I haven't measured the program execution time for the above case as my system becomes unresponsive (the memory gets full).
When I use an iterator like below (And this is what I'm actually using in my code)
with open('file.csv', 'r') as inFile:
for line in inFile:
My disk usage remains < 10% and speed are < 5 Mbps and it takes ~20 minutes for the program to finish execution for a 5 GB file. Wouldn't this time be lower if my disk usage was high?
Also, does it really take ~20 minutes to read a 5 GB, process it line by line making some modifications on each line and finally writing it to a new file or am I doing something wrong?
What I can't figure out is why doesn't the program use my system to its full potential when performing the io operations. Because if it did, then my disk usage should have been higher, right?.