Using integers to encode short strings

Question

Suppose I were limited to using only 32-bit unsigned integers to express strings. Obviously, I could use individual u8 numbers and allocate enough separate values to describe a short string, but say compute and time aren’t important, this being for my curiosity, not necessarily for a real world use.

I observe that a 32-bit number is the same size as 4 strict u8 chars. In decimal, there’s space to encode 4 of any character-encoding that could be indexed by a 2-digit decimal as their decimal equivalent, while 5 ECMA-1 characters could fit in the same bitsize.

Suppose I want the range of printable characters, using a mapped ASCII table, where I subtract 32 to get the printable characters into 2 decimal digits (32 to 126 become 0 to 94). Suppose a mapping function similar to |c,i|c-31*(10^((i+1)*2)), where c is the ASCII value and i is the position: 45769502. In ASCII values as a u8 array [66, 97, 116, 33], or the string “Bat!”

Clearly this is not computationally efficient. I’m not necessarily shooting for that? Just pure curiosity here.

Supposing compute is arbitrary, so even being totally absurd, how might I encode a longer string in a 32-bit unsigned integer?

score 1 · Accepted Answer · answered Jan 08 '21 at 14:32

1

First you need to decide on which characters you want to encode. Suppose you have chosen k characters which you have mapped to the numbers 0 to k-1. Then every integer n is mapped to a unique non-empty string by expressing n in base k and mapping each k-ary digit to the corresponding character. You could reserve the maximum integer for the empty string.

So you just need a mapping table for the k characters and a function to convert an integer from one base to another, that's simple and efficient, and the encoding is also optimally dense (since every integer maps to a unique string).

answered Jan 08 '21 at 14:32

Mo B.

5,307
3
25
42

I was thinking about it again this afternoon, and I realized that I'm implicitly doing that exact thing in a most inefficient way. Essentially, I've observed that one can encode base-100 mappings of symbols onto an int. My method was to first convert from base-2 to base-10, and then use two-digit base 10 representation -- effectively base-100, the long-way-round. Looks like a base-73 is the most densely expressive I can get in a 32-bit space, with 5 characters encoded in a reasonable symbol-set. I suppose with state machine semantics, I could be more expressive...but, eww. – Sam Hughes Jan 13 '21 at 07:54

Using integers to encode short strings

1 Answers1