1

In my Python application, I want to find a way of space-efficiently encoding a sequence of unsigned integers in a file, small integers being very common and larger integers being relatively uncommon. I understand that what I need is some form of variable-length quantity encoding (VLQ).

I have a (possibly flawed and/or possibly CPython-specific) memory of reading somewhere that Python uses a VLQ strategy internally to represent an int. Are these encoding/decoding routines accessible from Python? And/or is there another way of performing VLQ fast in Python?

Possibilities I have investigated:

  • I do not see anything relevant in the struct module documentation;
  • I do not get any meaningful hits from web searches for python VLQ or python "group varint";
  • I could roll my own implementation in pure Python, but feel this would be absurdly slow when surely there must be faster ready-baked solutions somewhere out there, or already lurking somewhere in Python or its standard library;
  • I tried abusing the builtin UTF-8 encoder and decoder (which could at least get me integers up to 1114111, which would be a good start for my application) by saying encode = lambda n: chr(n).encode('utf-8') and decode = lambda x: ord(x.decode('utf-8')) but unfortunately encode(n) raises a UnicodeEncodeError with the message "surrogates not allowed" when 55296 <= n <= 57343.

Update: @don'ttalkjustcode was correct to identify the pickle module as a place where the Python standard library does something like this, specifically in pickle.encode_long() and pickle.decode_long(). They are pure-Python implementations, and in Python 2.7 they wrap around a binary submodule called pickle._binascii, but in Python 3.2+ the work seems to be done by built-in methods of the int class, leading to:

encode = lambda n: n.to_bytes(n.bit_length()//8 + 1, 'little', signed=False)

decode = lambda x: int.from_bytes(x, 'little', signed=False)

However I guess these are incomplete (or, for small integers, inefficient) variable-length encodings because you would need to spend another byte to separately encode the encoded length.

What I really need is something like the LEB128 encoding, for which pure-Python solutions exist, but now that I see int.to_bytes() and int.from_bytes() my guess is that Python does not do this natively.

jez
  • 14,867
  • 5
  • 37
  • 64
  • https://stackoverflow.com/questions/538551/handling-very-large-numbers-in-python ; and it doesn't make sense to store numbers in something smaller than an int, because the object overhead would be larger than the data size if you did. – user10489 Aug 28 '21 at 23:36
  • @user10489 Could you elaborate on your assertion? I can't make sense of it and to me it looks like you might've misunderstood the question. – no comment Aug 28 '21 at 23:50
  • @user10489 it might make sense for writing those numbers to a file. But in that case you'd probably be better off using the [`zipfile` module](https://docs.python.org/3/library/zipfile.html). – Mark Ransom Aug 28 '21 at 23:51
  • @user10489 Hmm, you make it sound like Python ints aren't objects. Is that what you think? – no comment Aug 28 '21 at 23:56
  • No, I'm saying python numbers _are_ objects, and object have overhead plus the memory needed to store the number. So if python used a data type smaller than an int, the overhead would be larger than the memory used to store the data. Plus padding and memory alignment... It makes sense for numpy arrays, but not individual number variables. – user10489 Aug 29 '21 at 01:22
  • Played a bit with it after all. Maybe that LEB128 package is good, but it seems to be only for one integer, so for a *sequence* of integers, you might still need to add something to separate them. Unless it does something like UTF-8. But the package description only talks about a single integer, and if it could do a sequence, I'd expect it to mention that as well. – no comment Aug 29 '21 at 01:42
  • @don'ttalkjustcode The leb128 package can do sequences by the nature of the algorithm. A byte with its most significant bit unset signals the end of an element, so you always know when to stop. Hence that package provides `decode_read()` which can be used repeatedly to decode a sequence from a file or from a `BytesIO` instance or similar. – jez Aug 29 '21 at 01:53
  • Ah, ok. I didn't see that, had to look at their source code to find that information. Anyway, updated my answer with an improvement. I'd be interested in an answer with a similar test using LEB128. – no comment Aug 29 '21 at 02:01
  • Added LEB128 to my answer, barely any better. – no comment Aug 29 '21 at 04:32

1 Answers1

1

Yes, pickle does something like it. And it looks decent to me. For example, a million random ints with random sizes of 1 to 16 bytes get encoded to ~10.75 MB. And lzma.compress gets that down to ~10 MB. Not that bad compared to the "raw data size" of 8.5 MB (a million ints averaging 8.5 bytes each). LEB128 also takes ~10 MB, very slightly less.

import os, random, pickle, lzma, leb128

n = 10 ** 6
a = [int.from_bytes(os.urandom(random.randint(1, 16)), 'big') for _ in range(n)]
p = pickle.dumps(a)
print("pickle'd:", f'{len(p):,}', type(p))
z = lzma.compress(p)
print("+ lzma'd:", f'{len(z):,}', type(z))
leb = b''.join(map(leb128.u.encode, a))
print("leb128'd:", f'{len(leb):,}', type(leb))

Output:

pickle'd: 10,751,961 <class 'bytes'>
+ lzma'd: 10,060,252 <class 'bytes'>
leb128'd: 10,016,053 <class 'bytes'>

Try it online!

no comment
  • 6,381
  • 4
  • 12
  • 30