In my Python application, I want to find a way of space-efficiently encoding a sequence of unsigned integers in a file, small integers being very common and larger integers being relatively uncommon. I understand that what I need is some form of variable-length quantity encoding (VLQ).
I have a (possibly flawed and/or possibly CPython-specific) memory of reading somewhere that Python uses a VLQ strategy internally to represent an int
. Are these encoding/decoding routines accessible from Python? And/or is there another way of performing VLQ fast in Python?
Possibilities I have investigated:
- I do not see anything relevant in the
struct
module documentation; - I do not get any meaningful hits from web searches for
python VLQ
orpython "group varint"
; - I could roll my own implementation in pure Python, but feel this would be absurdly slow when surely there must be faster ready-baked solutions somewhere out there, or already lurking somewhere in Python or its standard library;
- I tried abusing the builtin UTF-8 encoder and decoder (which could at least get me integers up to 1114111, which would be a good start for my application) by saying
encode = lambda n: chr(n).encode('utf-8')
anddecode = lambda x: ord(x.decode('utf-8'))
but unfortunatelyencode(n)
raises aUnicodeEncodeError
with the message "surrogates not allowed" when55296 <= n <= 57343
.
Update: @don'ttalkjustcode was correct to identify the pickle
module as a place where the Python standard library does something like this, specifically in pickle.encode_long()
and pickle.decode_long()
. They are pure-Python implementations, and in Python 2.7 they wrap around a binary submodule called pickle._binascii
, but in Python 3.2+ the work seems to be done by built-in methods of the int
class, leading to:
encode = lambda n: n.to_bytes(n.bit_length()//8 + 1, 'little', signed=False)
decode = lambda x: int.from_bytes(x, 'little', signed=False)
However I guess these are incomplete (or, for small integers, inefficient) variable-length encodings because you would need to spend another byte to separately encode the encoded length.
What I really need is something like the LEB128 encoding, for which pure-Python solutions exist, but now that I see int.to_bytes()
and int.from_bytes()
my guess is that Python does not do this natively.