1

Boy am I not understanding the python pass-by-reference issues... I have created the extremely useful "unpacker" class which I pass around to various objects that need to unpack from it, yet given how extraordinarily slow it is, I can tell it's making a copy of binaryStr each time I pass a BU object. I know this because if I break the BU into smaller chunks, it runs, literally, 100x faster (I was originally using it to hold a 16MB file I/O buffer)

So my question is, why is that member not getting passed by reference, and is there a way to force it to? I am pretty sure the BU object itself is passed by reference (since my code works), but the speed suggests the .binaryStr object is copied. Is there something more subtle that I'm missing?

class BinaryUnpacker(object):
    def __init__(self, binaryStr):
        self.binaryStr = binaryStr
        self.pos = 0

    def get(self, varType, sz=0):
        pos = self.pos
        if varType == UINT32:
            value = unpack('<I', self.binaryStr[pos:pos+4])[0]
            self.pos += 4
            return value
        elif varType == UINT64:
            value = unpack('<Q', self.binaryStr[pos:pos+8])[0]
            self.pos += 8
            return value
        elif varType == VAR_INT:
            [value, nBytes] = unpackVarInt(self.binaryStr[pos:])
            self.pos += nBytes
        ....

The use case for this is something along the lines of :

def unserialize(self, toUnpack):
    if isinstance(toUnpack, BinaryUnpacker):
        buData = toUnpack
    else:  # assume string
        buData = BinaryUnpacker(toUnpack)

    self.var1    = buData.get(VAR_INT)
    self.var2    = buData.get(BINARY_CHUNK, 64)
    self.var3    = buData.get(UINT64)
    self.var4obj = AnotherClass().unserialize(buData)

Thanks so much for your help.

etotheipi
  • 11
  • 1
  • Every name in Python is a reference. – Cat Plus Plus Jul 21 '11 at 13:14
  • 1
    I'm not sure what language background you're coming from but generally it's frowned upon to write getters and setters in python. Additionally, one of python's greatest strengths is so called duck typing. That is, you don't need to insert your own type checking, since if some other object implements the same methods, you can use that object as a drop in replacement. Anyway, I can't answer your question exactly, but I think you might benefit from reading "Python is not Java" http://dirtsimple.org/2004/12/python-is-not-java.html if you come from a staticly typed language background. – Wilduck Jul 21 '11 at 13:15
  • Python always passes by reference. You are just making a copy when you use slices like `self.binaryStr[pos:]` – agf Jul 21 '11 at 13:18
  • @Wildluck "generally it's frowned upon to write getters and setters in Python"? No, you just do it a __different way__ than you do in other languages. Properties still involve writing getters and setters. – agf Jul 21 '11 at 13:20
  • 3
    @agf: well, its still the case that you don't write them, unless it turns out you actually needed them, at which point you use can use a descriptor (like `property`). you don't need (and shouldn't) to do this *just in case* as with java. – SingleNegationElimination Jul 21 '11 at 13:25
  • These are not getters and setters... the "get" method reads the next few bytes from the binary string a increments the position counter. It enables a very straight forward way to read incremental bytestrings from a large binary sequence – etotheipi Jul 21 '11 at 15:27

3 Answers3

4

The copies are made when you slice a string to get a substring. For example:

[value, nBytes] = unpackVarInt(self.binaryStr[pos:])

This will create a copy of the string from index pos to the end, which can take time for a long string. It will be faster if you can determine the number of bytes you actually need before taking the substring, and then use self.binaryStr[pos:pos+nBytes], since taking a small substring is relatively fast.

Note that the time depends only on the length of the substring, so self.binaryStr[pos:pos+4] should take roughly the same amount of time regardless of the length of self.binaryStr.

interjay
  • 107,303
  • 21
  • 270
  • 254
  • Ack, that might be my problem. I should slice out the maximum size of a VAR_INT (9 bytes), instead of the whole string. I stupidly use [pos:] because it's variable-length, but there is a hard upper-bound on length... – etotheipi Jul 21 '11 at 13:53
3

I did not look at your code in depth, but types that expose a buffer() method (such as strings) can be accessed with memoryview objects without having to copy the data. Here's the relevant documentation for it.

You could use a memoryview object instead of slicing the string: this way you would bypass the time-consuming passage of your current code.

A few days ago I asked a question about this that perhaps could be useful to you.

Community
  • 1
  • 1
mac
  • 42,153
  • 26
  • 121
  • 131
0

I don't think judging simply by speed is proper. You said you can tell that the string is being copied because if you break it into smaller chunks it runs much faster. But the running time of unpack() function which you didn't give detail about could also depend on the data size.

Besides, slicing a string such as

unpack('<I', self.binaryStr[pos:pos+4])[0]

will create new string objects since strings are immutable objects.

Derrick Zhang
  • 21,201
  • 18
  • 53
  • 73
  • That's struct.unpack, which is not my own function. It's one of my favorite reasons for using python, since pack/unpack make binary I/O absolutely trivial... – etotheipi Jul 21 '11 at 13:55