Background
In the past I've written an encoder/decoder for converting an integer to/from a string using an arbitrary alphabet; namely this one:
abcdefghjkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ23456789
Lookalike characters are excluded, so 1
, I
, l
, O
, and 0
are not present in this alphabet. This was done for user convenience and to make it easier to read and to type out a value.
As mentioned above, my previous project, python-ipminify
converts a 32-bit IPv4 address to a string using an alphabet similar to the above, but excluding upper-case characters. In my current undertaking, I don't have the constraint of excluding upper-case characters.
I wrote my own Python for this project using the excellent question and answer here on how to build a URL-shortener.
I have published a stand-alone example of the logic here as a Gist.
Problem
I'm now writing a performance-critical implementation of this in a compiled language, most likely Rust, but I'd need to port it to other languages as well.. I'm also having to accept an arbitrary-length array of bytes, rather than an arbitrary-width integer, as is the case in Python.
I suppose that as long as I use an unsigned integer and use consistent endianness, I could treat the byte array as one long arbitrary-precision unsigned integer and do division over it, though I'm not sure how performance will scale with that. I'd hope that arbitrary-precision unsigned integer libraries would try to use vector instructions where possible, but I'm not sure how this would work when the input length does not match a specific instruction length, i.e. when the input size in bits is not evenly divisible by supported instructions, e.g. 8, 16, 32, 64, 128, 256, 512 bits.
I have also considered breaking up the byte array into 256-bit (32 byte) blocks and using SIMD instructions (I only need to support x86_64 on recent CPUs) directly to operate on larger unsigned integers, but I'm not exactly sure how to deal with size % 32 != 0
blocks; I'd probably need to zero-pad, but I'm not clear on how I would know when to do this during decoding, i.e. when I don't know the underlying length of the source value, only that of the decoded value.
Question
If I'm going the arbitrary unsigned integer width route, I'd essentially be at the mercy of the library author, which is probably fine; I'd imagine that these libraries would be fairly optimized to vectorize as much as possible.
If I try to go the block route, I'd probably zero-pad any remaining bits in the block if the input length was not divisible by the block size during encoding. However, would it even be possible to decode such a value without knowing the decoded value size?