3

I have to parse a huge (250 MB) text file, which for some reason is only a single line, causing every text editor I tried (Notepad++, Visual Studio, Matlab) to fail loading it. Therefore I read it piece by piece, and parse it whenever a logical line (starting with #) is completely read:

f = open(filename, "rt")

line = ""
buffer = "blub"
while buffer != "":
    buffer = f.read(10000)
    i = buffer.find('#')
    if i != -1: # end of line found
        line += buffer[:i]
        ProcessLine(line)
        line = buffer[i+1:] # skip the '#'
    else: # still reading current line
        line += buffer

This works reasonably well, however, it might happen, that a line is shorter than my buffer, which would cause me to skip a line. So I replaced the loop by

while buffer != "":
    buffer = f.read(10000)
    i = buffer.find('#')
    while i != -1:
        pixels += 1
        line += buffer[:i]
        buffer = buffer[i+1:]
        ProcessLine(line)
        i = buffer.find('#')
    line += buffer

, which does the trick. However this is at least a hundred times slower, rendering it useless to read that large files. I don't really see, how this can happen, I do have a inner loop, but most of the times it is only repeated once. Also I probably copy the buffer (buffer = buffer[i+1:]), from which I could somehow understand if the performance dropped by half, but I don't see how this could make it like 100 times slower.

As a side note: My (logical) lines are about 27.000 bytes. Therefore, if my buffer is 10.000 bytes, I never skip lines in the first implementation, if it is 30.000, I do. This does however not seem to impact the performance, even if the inner loop in the second implementation is evaluated at most once, performance is still horrible.

What is going on under the hood, that I miss?

JonathanK
  • 827
  • 1
  • 6
  • 23
  • 1
    In first version you do the splitting only once per 10.000, in second one you do it as many times as # happen to be in the chunk, so it should be X times slower where X is average amount of #s per 10.000 bytes. If it is X times slower then here is your answer. – Andrey Jun 28 '16 at 09:40
  • No, that can't be it. In my test case, I have a # every ~27.000 characters. If my buffer length is 30.000, i skip some of these, if it is 10.000 i get them all. In both cases, the performance with the second implementation is horrible. I added this as a note. – JonathanK Jun 28 '16 at 10:05
  • 250MB is a small amount of data. Why don't you just read it all to memory, and process it there? It'll be a lot faster, even with OS file read buffers. – advance512 Jun 28 '16 at 10:07
  • Also, instead of creating the `line` variable using concatentation etc, just find the indices of first # and of the next #, and use a slice of the buffer - you'll also see a big improvement using this approach. – advance512 Jun 28 '16 at 10:09
  • Have you [profiled](http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) it? It's hard to talk about performance when we don't know where the bottleneck is. – Łukasz Rogalski Jun 28 '16 at 10:13
  • use list append and join instead of string concatenation in python – YOU Jun 28 '16 at 10:42

2 Answers2

2

If I understood correctly what you want to do, than both versions of your code are wrong. Like @Leon said in second version you are missing line = "" after ProcessLine(line), and in first version just the first line is correct, and than like you sad if line is shorter than buffer, you use just first part of buffer in line += buffer[:i] but the problem is in this line line = buffer[i+1:] so if your line is like 1000 characters long, and buffer is 10000 characters long, then when you use line += buffer[:i], your line will be 9000 characters long probably containing more than one line. From reading:

"This works reasonably well, however, it might happen, that a line is shorter than my buffer, which would cause me to skip a line"

I think you realised that, but the reason I am writing in detail, is that it is also reason why your first version works faster.

After explaining that, I think the best would be to read hole file and then split text to get lines, so your code would look like this:

f = open('textfile.txt', "rt")
buffer = f.read()
f.close()
l = buffer.split('#')

and than you can use something like:

for line in l:
    ProcessLine(line)

to get list l it took me less than 2 seconds.

PS: You shouldn't have problems with opening large files (like 250MB) with notepad, I even opened 500MB files.

ands
  • 1,926
  • 16
  • 27
  • I think the problem is, that everything is just in one line. If I create a newline after each `#` and save it, I can read it with Notepad, no problem. Likewise, doing just `for l in open(filename)` does not work as well. However your solution with `read()` and `split()` works nice and is easier, than what I did so far, so i will go with that. – JonathanK Jun 28 '16 at 14:58
1

Your second version not only works slower but also works incorrectly.

In your first version you reset the line with assignment (line = buffer[i+1:]), whereas in the second version you only append to line. As a result, in the second version, in the end line contains entire contents of your file less the # symbols.

Fix the code by clearing line immediately after processing it:

while buffer != "":
    buffer = f.read(10000)
    i = buffer.find('#')
    while i != -1:
        pixels += 1
        line += buffer[:i]
        buffer = buffer[i+1:]
        ProcessLine(line)
        line = ""               # sic!
        i = buffer.find('#')
    line += buffer
Leon
  • 31,443
  • 4
  • 72
  • 97
  • You are right. And this essentially makes the algorithm quadratic in runtime, as the beginning of `line` is parsed over and over again. Thus, it perfectly explains the highly decreased performance. – JonathanK Jun 28 '16 at 15:00