I have to parse a huge (250 MB) text file, which for some reason is only a single line, causing every text editor I tried (Notepad++, Visual Studio, Matlab) to fail loading it. Therefore I read it piece by piece, and parse it whenever a logical line (starting with #
) is completely read:
f = open(filename, "rt")
line = ""
buffer = "blub"
while buffer != "":
buffer = f.read(10000)
i = buffer.find('#')
if i != -1: # end of line found
line += buffer[:i]
ProcessLine(line)
line = buffer[i+1:] # skip the '#'
else: # still reading current line
line += buffer
This works reasonably well, however, it might happen, that a line is shorter than my buffer, which would cause me to skip a line. So I replaced the loop by
while buffer != "":
buffer = f.read(10000)
i = buffer.find('#')
while i != -1:
pixels += 1
line += buffer[:i]
buffer = buffer[i+1:]
ProcessLine(line)
i = buffer.find('#')
line += buffer
, which does the trick. However this is at least a hundred times slower, rendering it useless to read that large files. I don't really see, how this can happen, I do have a inner loop, but most of the times it is only repeated once. Also I probably copy the buffer (buffer = buffer[i+1:]
), from which I could somehow understand if the performance dropped by half, but I don't see how this could make it like 100 times slower.
As a side note: My (logical) lines are about 27.000 bytes. Therefore, if my buffer is 10.000 bytes, I never skip lines in the first implementation, if it is 30.000, I do. This does however not seem to impact the performance, even if the inner loop in the second implementation is evaluated at most once, performance is still horrible.
What is going on under the hood, that I miss?