4

I have several logfiles that I would like to read. Without loss of generality, let's say the logfile processing is done as follows:

def process(infilepath):
    answer = 0
    with open (infilepath) as infile:
        for line in infile:
            if line.startswith(someStr):
                answer += 1
    return answer

Since I have a lot of logfiles, I wanted to throw multiprocessing at this problem (my first mistake: I should have probably used multi-threading; someone please tell me why)

While doing so, it occurred to me that any form of parallel processing should be effectively useless here, since I'm constrained by the fact that there is only one read head on my HDD, and therefore, only one file may be read at a time. In fact, under this reasoning, due to the fact that lines from different files may be requested simultaneously, the read head may need to move significantly from time to time, causing the multiproc approach to be slower than a serial approach. So I decided to go back to a single process to read my logfiles.

Interestingly though, I noticed that I did get a speedup with small files (<= 40KB), and that it was only with large files (>= 445MB) that the expected slow-down was noticed.

This leads me to believe that python may read files in chunks, whose size exceeds more than the one line I request at a time.

Q1: So what is the file-reading mechanism under the hood?

Q2: What is the best way to optimize the reading of files from a conventional HDD?

Technical specs:

  • python3.3
  • 5400rpm conventional HDD
  • Mac OSX 10.9.2 (Mavericks)
user3666197
  • 1
  • 6
  • 50
  • 92
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • s/python/operating system/ – Ignacio Vazquez-Abrams Apr 07 '14 at 17:15
  • Your OS probably implements a read-ahead strategy of at least 64K bytes at the block level. – kindall Apr 07 '14 at 17:17
  • 1
    and under the hood there's buffered IO. note: what's under the hood isn't python specific. a good starting point is to check what happens in C and in the OS. – Karoly Horvath Apr 07 '14 at 17:25
  • What you have described are the problems involved with the virtual paging system and how data is read from files for the operating system in buffered IO. For more information, http://en.wikipedia.org/wiki/Virtual_memory. – Rusty Weber Apr 07 '14 at 17:50
  • From the python code that you have listed, "for line in infile" might be creating lots of overhead depending on how they implemented the iterator. Have you tried implementing different methods for reading the file? (I hate the 5 min edit rule... accidentally pressed enter) – Rusty Weber Apr 07 '14 at 17:57
  • This was an interesting observation while I was getting some work done, so I didn't have time to check different file-read methods. However, that /would/ normally be my first stop. I was just checking to see if anyone had some wisdom they could share before I embarked on re-discovering documented wisdom. – inspectorG4dget Apr 07 '14 at 18:03

3 Answers3

2

The observed behavior is a result of:

  1. BufferedIO
  2. a scheduling algorithm that decides the order in which the requisite sectors of the HDD are read

BufferedIO

Depending on the OS and the read block size, it is possible for the entire file to fit into one block, which is what is read in a single read command. This is why the smaller files are read more easily

Scheduling Algorithm

Larger files (filesize > read block size), have to be read in block size chunks. Thus, when a read is requested on each of several files (due to the multiprocessing), the needle has to move to different sectors (corresponding to where the files live) of the HDD. This repetitive movement does two things:

  1. increase the time between successive reads on the same file
  2. throw off the read-sector predictor, as a file may span multiple sectors

The time between successive reads of the same file matters if the computation performed on a chunk of lines completes before the read head can provide the next chunk of lines from the same file, the process simply waits until another chunk of lines becomes available. This is one source of slowdowns

Throwing off the read-predictor is bad for pretty much the same reasons as why throwing off the branch predictor is bad.

With the combined effect of these two issues, processing many large files in parallel would be slower than processing them serially. Of course, this is more true when processing blockSize many lines finishes before numProcesses * blockSize many lines can be read out of the HDD

Community
  • 1
  • 1
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
1

another idea would be to profile your code

try:
    import cProfile as profile
except ImportError:
    import profile

profile.run("process()")
0

here is an example of using a memory map file

import mmap 
with open("hello.txt", "r+b") as f:
     mapf = mmap.mmap(f.fileno(), 0)
     print(mapf.readline()) 
     mapf.close()
    enter code here
  • 1
    How would memory-mapped files help here? In the original post, Python only reads the file mostly line by line, without ever using much memory (unless lines are huge, which should not be the case, since the file is open in text mode). – Eric O. Lebigot Oct 25 '14 at 03:48