I have several logfiles that I would like to read. Without loss of generality, let's say the logfile processing is done as follows:
def process(infilepath):
answer = 0
with open (infilepath) as infile:
for line in infile:
if line.startswith(someStr):
answer += 1
return answer
Since I have a lot of logfiles, I wanted to throw multiprocessing at this problem (my first mistake: I should have probably used multi-threading; someone please tell me why)
While doing so, it occurred to me that any form of parallel processing should be effectively useless here, since I'm constrained by the fact that there is only one read head on my HDD, and therefore, only one file may be read at a time. In fact, under this reasoning, due to the fact that lines from different files may be requested simultaneously, the read head may need to move significantly from time to time, causing the multiproc approach to be slower than a serial approach. So I decided to go back to a single process to read my logfiles.
Interestingly though, I noticed that I did get a speedup with small files (<= 40KB), and that it was only with large files (>= 445MB) that the expected slow-down was noticed.
This leads me to believe that python may read files in chunks, whose size exceeds more than the one line I request at a time.
Q1: So what is the file-reading mechanism under the hood?
Q2: What is the best way to optimize the reading of files from a conventional HDD?
Technical specs:
- python3.3
- 5400rpm conventional HDD
- Mac OSX 10.9.2 (Mavericks)