0

I have a buffer and I need to make sure I don't exceed certain size. If I do, I want to append the buffer to a file and empty it.

My code:

import sys

MAX_BUFFER_SIZE = 4 * (1024 ** 3)

class MyBuffer(object):
    b = ""

    def append(self, s):
        if sys.getsizeof(self.b) > MAX_BUFFER_SIZE:
            #...print to file... empty buffer
            self.b = ""
        else:
            self.b += s


buffer = MyBuffer()
for s in some_text:
    buffer.append(s)

However, this comparison (sys.getsizeof(self.buffer) > MAX_BUFFER_SIZE) is way too slow (ie. without comparison, the whole execution takes less than 1 second, with comparison it takes like 5 minutes).

At the moment I can fit the whole some_string to memory and so buffer actually never gets bigger than MAX_BUFFER_SIZE, but I must make sure my code work for huge files (several TBs in size) too.

Edit:

This code runs in under 1 second:

import sys

buffer = ""
for s in some_text:
    buffer += s

#print out to file

The problem is that the buffer might become too big.

Similarly, this code also runs under 1 second:

import sys

MAX_BUFFER_SIZE = 4 * (1024 ** 3)

class MyBuffer(object):
    b = ""

    def append(self, s):
        print sys.getsizeof(self.b)


buffer = MyBuffer()
for s in some_text:
    buffer.append(s)

EDIT 2:

Sorry, the slow part is actually appending to the buffer, not the comparison itself as I thought... When I was testing the code, I commented out the whole if / else statement instead of just the first part.

Hence, is there an efficient way to keep a buffer?

emihir0
  • 1,200
  • 3
  • 16
  • 39
  • 1
    I suggest to use `len(self.b)` instead of `sys.getsizeof(self.b)`. `self.b` is a simple string, so retrieving its size is trivial and fast. Note, however, that constanstly appending to a string is slow, since it often needs to reallocate the memory for the string and reallocating 4 GB is going to take a while. – Sven Marnach Oct 07 '16 at 10:35
  • When I change buffer.append(s) to just `my_global_buffer += s` and then print out to the file at the very end, the execution is still less than 1 second, so I don't think appending to the buffer itself is the slow part. – emihir0 Oct 07 '16 at 10:36
  • @emihir0 CPython can optimise appending to the buffer in certain cases, and it's a bit tricky to figure out whether it's possible in your case. It depends on the reference count of the string you are appending to, and maybe on some other factors. – Sven Marnach Oct 07 '16 at 10:40
  • @SvenMarnach Basically my initial solution was to append to the file in every iteration of the `for s in some_text:`. Obviously, that is slow and so I thought it would be more efficient to make a buffer and then append at the end. Now I need to make sure the buffer doesn't exceed some size. Is there some already existing solution for this use case (that is efficient of course)? – emihir0 Oct 07 '16 at 10:41
  • @emihir0 Yes, use the [`buffering` parameter to the `open()` function](https://docs.python.org/3/library/functions.html#open). – Sven Marnach Oct 07 '16 at 10:44
  • @SvenMarnach how would that work though? Wouldn't I still have to keep `buffer` that I append and then `open('myfile.txt', 'a+', MAX_BUFFER_SIZE)? I've read the documentation but am a bit confused about the use case. – emihir0 Oct 07 '16 at 10:47
  • @emihir0 You simply write to the file in each iteration. Python takes care of buffering for you. Here's more information on string concatenation: http://stackoverflow.com/questions/4435169/good-way-to-append-to-a-string – Sven Marnach Oct 07 '16 at 10:49
  • if I use `open('myfile.txt', 'a+', MAX_BUFFER_SIZE) as f: f.write(s)` within the loop then it still takes way longer than it should have. I changed `MAX_BUFFER_SIZE` to `100 * (1024 ** 2)` thought (100mb) as it was giving me errors otherwise. – emihir0 Oct 07 '16 at 10:56
  • 1
    @emihir0 Are you really opening and closing the file in each loop iteration? Of course that's slow! Just open it once, and close it when you are done. You probably don't even need to bother with the `buffering` parameter. The default should be fine (and 100mb is way too big). – Sven Marnach Oct 07 '16 at 11:03

1 Answers1

1

Undeleting and editing my answer based on edits to the question.

It's incorrect to assume that the comparision is slow. In fact the comparision is fast. Really, really fast.

Why don't you avoid re inventing the wheel by using buffered IO?

The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used. [2]

https://docs.python.org/2/library/functions.html#open

e4c5
  • 52,766
  • 11
  • 101
  • 134
  • I can't follow. Why should the speed of `sys.getsizeof()` depend on the size of the string? – Sven Marnach Oct 07 '16 at 10:32
  • @e4c5 when I just print sys.getsizeof(self.b) and comment out the rest of the method, it still runs in under 1 second, hence I would have thought that getting the size of buffer is not the expensive part, it's the comparison. – emihir0 Oct 07 '16 at 10:35
  • @emihir0 My suspicion is that either writing to the file or appending to the string is slow. Appending to a string in a loop can make the time complexity O(n²), since the whole string needs to be copied in each iteration. – Sven Marnach Oct 07 '16 at 10:36
  • Yes, I just did some testing, my answer is indeed incorrect! – e4c5 Oct 07 '16 at 10:37