1

Background

In the past I've written an encoder/decoder for converting an integer to/from a string using an arbitrary alphabet; namely this one:

abcdefghjkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ23456789

Lookalike characters are excluded, so 1, I, l, O, and 0 are not present in this alphabet. This was done for user convenience and to make it easier to read and to type out a value.

As mentioned above, my previous project, python-ipminify converts a 32-bit IPv4 address to a string using an alphabet similar to the above, but excluding upper-case characters. In my current undertaking, I don't have the constraint of excluding upper-case characters.

I wrote my own Python for this project using the excellent question and answer here on how to build a URL-shortener.

I have published a stand-alone example of the logic here as a Gist.

Problem

I'm now writing a performance-critical implementation of this in a compiled language, most likely Rust, but I'd need to port it to other languages as well.. I'm also having to accept an arbitrary-length array of bytes, rather than an arbitrary-width integer, as is the case in Python.

I suppose that as long as I use an unsigned integer and use consistent endianness, I could treat the byte array as one long arbitrary-precision unsigned integer and do division over it, though I'm not sure how performance will scale with that. I'd hope that arbitrary-precision unsigned integer libraries would try to use vector instructions where possible, but I'm not sure how this would work when the input length does not match a specific instruction length, i.e. when the input size in bits is not evenly divisible by supported instructions, e.g. 8, 16, 32, 64, 128, 256, 512 bits.

I have also considered breaking up the byte array into 256-bit (32 byte) blocks and using SIMD instructions (I only need to support x86_64 on recent CPUs) directly to operate on larger unsigned integers, but I'm not exactly sure how to deal with size % 32 != 0 blocks; I'd probably need to zero-pad, but I'm not clear on how I would know when to do this during decoding, i.e. when I don't know the underlying length of the source value, only that of the decoded value.

Question

If I'm going the arbitrary unsigned integer width route, I'd essentially be at the mercy of the library author, which is probably fine; I'd imagine that these libraries would be fairly optimized to vectorize as much as possible.

If I try to go the block route, I'd probably zero-pad any remaining bits in the block if the input length was not divisible by the block size during encoding. However, would it even be possible to decode such a value without knowing the decoded value size?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Naftuli Kay
  • 87,710
  • 93
  • 269
  • 411
  • How recent? If you can exclude Zen 1 / Zen 2, Intel since Haswell and AMD since Zen3 have efficient BMI2 pdep / pext that could help repack bits from 5 or 6 bits per byte down to contiguous. Unfortunately those instructions are slow on older AMD. With `vpshufb` for lookup tables to map between integers and ASCII codes (for the low 4 bytes) that might get you somewhere... (I'm assuming you choose your base, i.e. alphabet size, to be a power of 2 so it's just a matter of re-grouping bits, not multiplying and dividing.) – Peter Cordes Sep 22 '21 at 21:37
  • 1
    Division by non-power-of-2 is expensive; arbitrary-precision division by non-power-of-2 is *very* expensive. If it makes any sense at all to set your alphabet size to a power of 2, by all means do it. – Nate Eldredge Sep 23 '21 at 14:17

0 Answers0