Compress 21 Alphanumeric Characters in to 16 Bytes

Question

I'm trying to take 21 bytes of data which uniquely identifies a trade and store it in a 16 byte char array. I'm having trouble coming up with the right algorithm for this.

The trade ID which I'm trying to compress consists of 2 fields:

18 alphanumeric characters consisting of the ASCII characters 0x20 to 0x7E, Inclusive. (32-126)
A 3-character numeric string "000" to "999"

So a C++ class that would encompass this data looks like this:

class ID
{
public:
    char trade_num_[18];
    char broker_[3];
};

This data needs to be stored in a 16-char data structure, which looks like this:

class Compressed
{
public:
    char sku_[16];    
};

I tried to take advantage of the fact that since the characters in trade_num_ are only 0-127 there was 1 unused bit in each character. Similarly, 999 in binary is 1111100111, which is only 10 bits -- 6 bits short of a 2-byte word. But when I work out how much I can squeeze this down, the smallest I can make it is 17 bytes; one byte too big.

Any ideas?

By the way, trade_num_ is a misnomer. It can contain letters and other characters. That's what the spec says.

EDIT: Sorry for the confusion. The trade_num_ field is indeed 18 bytes and not 16. After I posted this thread my internet connection died and I could not get back to this thread until just now.

EDIT2: I think it is safe to make an assumption about the dataset. For the trade_num_ field, we can assume that the non-printable ASCII characters 0-31 will not be present. Nor will ASCII codes 127 or 126 (~). All the others might be present, including upper and lower case letters, numbers and punctuations. This leaves a total of 94 characters in the set that trade_num_ will be comprised of, ASCII codes 32 through 125, inclusive.

does the compression have to be two way (ie is a one-way hash acceptable)? If so, could you use a lookup table to map? — Alan, Aug 05 '10 at 22:13
Are the characters alphanumeric (letters and digits only), or can they be any ASCII character? — John Kugelman, Aug 05 '10 at 22:15
Also why is trade_num[18] when it only needs to store 16 bytes? — Alan, Aug 05 '10 at 22:16
@John: The 16 characters in `trade_num_` above can be any ASCII character in the range 0x00 to 0x7F. (eg, 0-127, or just the first 127 characters in the ASCII range) — John Dibling, Aug 05 '10 at 22:17
@Alan: the class `ID` represents incoming data. I need to compress this to a max of 16 bytes. — John Dibling, Aug 05 '10 at 22:18
@John: Okay, so the spec has an 18-byte field, but only uses it for 16 bytes? — Alan, Aug 05 '10 at 22:22
John, the confusion stems from your question's contradiction; you first say "**16** alphanumeric characters consisting of the ASCII characters 0x00 to 0x7F" and then you declare "char trade_num_[**18**];" - I assume 18 is correct, which would add to 21 which matches your title. But please fix. — Ricket, Aug 05 '10 at 22:27
@John: Even \000 or \n or any of the non-printable ASCII values can be in _trade_num? So it is actually a 18-byte binary blob? — Nordic Mainframe, Aug 05 '10 at 22:31
@Alan et al: sorry for the confusion. The `trade_num_` field is indeed 18 bytes. The 16 was a typo. See my edit above. — John Dibling, Aug 09 '10 at 14:28
I apologize for the moving-target nature of this question. As I get closer to a solution I find more assumptions I have to make in order to get it to work. This is in part why I opened a bounty on this question -- to try to make it a little more worth everyone's while to keep up with the changes to the requirements. — John Dibling, Aug 09 '10 at 21:13
I'm also trying to keep up with edits to reflect changes in the requirements/assumptions. Sometimes I miss one, but when I become aware of it, I try to fix it right away. — John Dibling, Aug 09 '10 at 21:16

Mark Byers · Accepted Answer · 2010-08-09T20:56:34.020

34

If you have 18 characters in the range 0 - 127 and a number in the range 0 - 999 and compact this as much as possible then it will require 17 bytes.

>>> math.log(128**18 * 1000, 256)
16.995723035582763

You may be able to take advantage of the fact that some characters are most likely not used. In particular it is unlikely that there are any characters below value 32, and 127 is also probably not used. If you can find one more unused character so you can first convert the characters into base 94 and then pack them into the bytes as closely as possible.

>>> math.log(94**18 * 1000, 256)
15.993547951857446

This just fits into 16 bytes!

Example code

Here is some example code written in Python (but written in a very imperative style so that it can easily be understood by non-Python programmers). I'm assuming that there are no tildes (~) in the input. If there are you should substitute them with another character before encoding the string.

def encodeChar(c):
    return ord(c) - 32

def encode(s, n):
    t = 0
    for c in s:
        t = t * 94 + encodeChar(c)
    t = t * 1000 + n

    r = []
    for i in range(16):
        r.append(int(t % 256))
        t /= 256

    return r

print encode('                  ', 0)    # smallest possible value
print encode('abcdefghijklmnopqr', 123)
print encode('}}}}}}}}}}}}}}}}}}', 999)  # largest possible value

Output:

[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]
[ 59, 118, 192, 166, 108,  50, 131, 135, 174,  93,  87, 215, 177,  56, 170, 172]
[255, 255, 159, 243, 182, 100,  36, 102, 214, 109, 171,  77, 211, 183,   0, 247]

This algorithm uses Python's ability to handle very large numbers. To convert this code to C++ you could use a big integer library.

You will of course need an equivalent decoding function, the principle is the same - the operations are performed in reverse order.

edited Aug 09 '10 at 20:56

answered Aug 05 '10 at 22:16

Mark Byers

811,555
193
1,581
1,452

There's probably a decent chance that the space character (or at least one of the punctuation characters) might be found to be unused in the input set. – Michael Burr Aug 05 '10 at 22:49
@Mark, @Michael: Space and punctuation characters are frequently used in this kind of data. Moreover the spec says quite clearly that any ASCII character from 0x00 to 0x7F can be used in the trade number field, but in my experience the non-printable characters are not used. Trade IDs are generally human-readable, and my examination of this data feed supports the same to be true for this feed as well. So I think this solution will work. I'll post my solution when I have the code written. – John Dibling Aug 09 '10 at 14:32
@Mark: If I take out 127 and 126 (the tilde) this yields 94 characters in the input set. But 93 in binary is 1011101, which is still 7 bits. Am I missing something here? – John Dibling Aug 09 '10 at 16:07
2

John: What you're missing is base-94. Consider: a number from 0-199, base 10. The first digit has two options, so needs one bit. The next digits have 10 options each, so four bits. Total of 9 bits. Yet you know we can fit numbers from 0-199 in a single octet. The key is that *there are not a certain number of bits reserved for each digit, rather the entire sequence is converted to a large number which is then encoded in binary*. – Ben Voigt Aug 09 '10 at 20:12
@John Dibling: yes, you really have to squeeze your data to make it fit; just chopping the eight bit wont suffice. – liori Aug 09 '10 at 20:13
@Ben Voigt: Yes that is exactly what I had in mind. Though I'm not sure if C++ has good support for large integers so it might require using an external library, or else a clever algorithm to avoid ever having to store such large numbers. – Mark Byers Aug 09 '10 at 20:23
@Ben, @Mark: OK, I *think* I understand what you're saying, and if so I came to a similar conclusion independently. The problem I then ran in to was the encoding part. Take a much-simplified example. 3 characters, which can be A B or C. A has a value of 1, B=2, C=3. How do I encode this string? I can't just add the values together because they won't be unique, and if I concatenate the binary representation (eg 01, 10, 11), the aggregate will take up no less space than the individual fields. I suspect I'm just not getting something here. – John Dibling Aug 09 '10 at 20:36
@John Dibling: I've provided example source code in Python which I hope you can understand to get you started. If you want to convert this algorithm directly to C++ you would need a a big integer library for C++, and for that you might want to look at this thread: http://stackoverflow.com/questions/124332/c-handling-very-large-integers. – Mark Byers Aug 09 '10 at 20:39
@John Dibling: the only problem left is how sure you are that there will be no characters 0x00-0x1F and 0x7E-0x7F. If the standard says they can be there, there is a chance someone wanted to use that... then your code will be broken. You should at least have a check whether the characters are really in your allowed character set. – liori Aug 09 '10 at 21:00
@KennyTM: If you include tilde then it doesn't fit in 16 bytes unfortunately :( ... see my first paragraph for an explanation of why. – Mark Byers Aug 09 '10 at 21:02
@liori, @KennyTM, et al: It became apparent to me that inn order to make this work at all, I have to make certain assumptions about the data beyond what the spec says. Although the spec says "any character from 0x00 to 0x7F" in my experience these are human-readable strings. Also, the tilde character is not seen in practice. I will of course add code to verify that my assumptions are true at run-time (and if not handle it gracefully), but for now I am going on the assumption that the set of valid characters will be 0x20-0x7D inclusive. – John Dibling Aug 09 '10 at 21:09
@Mark re: edit. Ah, I was wondering about that 84. For a few minutes there I thought I was just incredibly stupid. :) – John Dibling Aug 09 '10 at 21:21
@Mark: Re: `for i in range(16):` does this execute 16 or 17 times? – John Dibling Aug 09 '10 at 21:29
@John Dibling: Have you found a big number library? Would you also like me to post a corresponding decoding function written in Python so that you can translate it to C++? – Mark Byers Aug 09 '10 at 21:59
@Mark: Thanks for asking. It will be difficult or impossible for me to introduce a bigint library to CVS for this, so I am trying to scavange what I need from http://sourceforge.net/projects/cpp-bigint/files/. No need to post the decoder. Your encoder was sufficient and I fully understand what I need to do. It will take me a day or so to get this written in the server & do some testing, but once I do I'll accept your answer. Thanks for your help. – John Dibling Aug 09 '10 at 22:23
I'm accepting this answer and awarding the bounty to it since the question in combination with the comments and Mark's patience in answering my questions was very helpful. – John Dibling Aug 12 '10 at 15:04

Nordic Mainframe · Answer 2 · 2010-08-05T22:42:16.670

5

That makes (18*7+10)=136 bits, or 17 bytes. You wrote trade_num is alphanumeric? If that means the usual [a-zA-Z0-9_] set of characters, then you'd have only 6 bits per character, needing (18*6+10)=118 bit = 15 bytes for the whole thing.

Assuming 8 bit = 1 byte

Or, coming from another direction: You have 128 bits for storage, you need ~10 bits for the number part, so there are 118 left for the trade_num. 18 characters means 118/18=6.555 bits per characters, this means you can have only the space to encode 26.555 = 94 different characters **unless there is a hidden structure in trade_num that we could exploit to save more bits.

edited Aug 05 '10 at 22:42

answered Aug 05 '10 at 22:16

Nordic Mainframe

28,058
10
66
83

As I said in my OP, the spec defines 'alphanumeric' as being any of the ASCII characters from 0x00 to 0x7F. This does not correlate exactly to what a C++ programmer naturally considers to be 'alphanumeric' – John Dibling Aug 09 '10 at 14:33
If the values on trade_num are independent and evenly distributed, then yes. – Nordic Mainframe Aug 09 '10 at 20:28

score 2 · Answer 3 · answered Aug 09 '10 at 20:35

This is something that should work, assuming you need only characters from allowedchars, and there is at most 94 characters there. This is python, but it is written trying not to use fancy shortcuts--so that you'll be able to translate it to your destination language easier. It assumes however that the number variable may contain integers up to 2**128--in C++ you should use some kind of big number class.

allowedchars=' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}'
alphabase = len(allowedchars)

def compress(code):
    alphanumeric = code[0:18]
    number = int(code[18:21])

    for character in alphanumeric:
        # find returns index of character on the allowedchars list
        number = alphabase*number + allowedchars.find(character)

    compressed = ''
    for i in xrange(16):
        compressed += chr(number % 256)
        number = number/256

    return compressed

def decompress(compressed):
    number = 0

    for byte in reversed(compressed):
        number = 256*number + ord(byte)

    alphanumeric = ''
    for i in xrange(18):
        alphanumeric = allowedchars[number % alphabase] + alphanumeric
        number = number/alphabase

    # make a string padded with zeros
    number = '%03d' % number

    return alphanumeric + number

score 1 · Answer 4 · answered Aug 05 '10 at 22:22

You can do this in ~~15bytes (14 bytes and 6 bits).

For each character from trace_num_ you can save 1 bit if you want save ascii in 7 bits.

Then you have 2 bytes free and 2 bits, you must have 5.

Let get number information, each char can be one from ten values (0 to 9). Then you must have 4 bits to save this character, to save number you must have 1 byte and 4 bits, then you save half of this.

Now you have 3 bytes free and 6 bits, you must have 5.

If you want to use only qwertyuioplkjhgfdsazxcvbnmQWERTYUIOPLKJHGFDSAZXCVBNM1234567890[] You can save each char in 6 bits. Then you have next 2 bytes and 2 bits.

Now you have 6 bytes left, and your string can save in 15 bytes + nulltermination = 16bytes.

And if you save your number in integer on 10 bytes. You can fit this into 14 bytes and 6 bits.

It can be more than the set of characters you suggest. My OP was quite clear on this before my edits, but I have made it even more clear now. — John Dibling, Aug 09 '10 at 20:05

Jay · Answer 5 · 2010-08-10T21:20:38.823

Key questions are:

There appears to be some contradiction in your post whether the trade number is 16 or 18 characters. You need to clear that up. You say the total is 21 consisting of 16+3. :-(

You say the trade num characters are in the range 0x00-0x7f. Can they really be any character in that range, including tab, new line, control-C, etc? Or are they limited to printable characters, or maybe even to alphanumerics?

Does the output 16 bytes have to be printable characters, or is it basically a binary number?

EDIT, after updates to original post:

In that case, if the output can be any character in the character set, it's possible. If it can only be printable characters, it's not.

Demonstration of the mathematical possibility is straightforward enough. There are 94 possible values for each of 18 characters, and 10 possible values for each of 3. Total number of possible combinations = 94 ^ 18 * 10 ^ 3 ~= 3.28E35. This requires 128 bits. 2 ^127 ~= 1.70e38, which is too small, while 2^128 ~= 3.40e38, which is big enough. 128 bits is 16 bytes, so it will just barely fit if we can use every possible bit combination.

Given the tight fit, I think the most practical way to generate the value is to think of it as a double-long number, and then run the input through an algorithm to generate a unique integer for every possible input.

Conceptually, then, let's imagine we had a "huge integer" data type that is 16 bytes long. The algorithm would be something like this:

huge out;
for (int p=0;p<18;++p)
{
  out=out*94+tradenum[p]-32;
}
for (int p=0;p<3;++p)
{
  out=out*10+broker[p]-'0';
}

// Convert output to char[16]
unsigned char[16] out16;
for (int p=15;p>=0;--p)
{
  out16[p]=huge&0xff;
  huge=huge>>8;
}

return out16;

Of course we don't have a "huge" data type in C. Are you using pure C or C++? Isn't there some kind of big number class in C++? Sorry, I haven't done C++ in a while. If not, we could easily create a little library to implement a huge.

score 1 · Answer 6 · edited May 23 '17 at 12:19

There are 95 characters between the space (0x20) and tilde (0x7e). (The 94 in other answers suffer from off-by-1 error).

Hence the number of distinct IDs is 95¹⁸×1000 = 3.97×10³⁸.

But that compressed structure can only hold (2⁸)¹⁶ = 3.40×10³⁸ distinct values.

Therefore it is impossible to represent all IDs by that structure, unless:

There is 1 unused character in ≥15 digits of trade_num_, or
There are ≥14 unused characters in 1 digit of trade_num_, or
There are only ≤856 brokers, or
You're using is a PDP-10 which has a 9-bit char.

score 0 · Answer 7 · answered Aug 05 '10 at 22:17

0

If it can only contain letters, then you have less than 64 possibilities per character (26 upper case, 26 lower case, leaving you 12 for space, terminator, underscore, etc). With 6 bits per character, you should get there - in 15 characters. Assuming you don't support special characters.

answered Aug 05 '10 at 22:17

EboMike

76,846
14
164
167

It can contain more than letters. Please see my revisions. – John Dibling Aug 09 '10 at 20:04

score 0 · Answer 8 · answered Aug 05 '10 at 23:26

Use the first 10 bits for the 3-character numeric string (encode the bits like they represent a number and then pad with zeros as appropriate when decoding).

Okay, this leaves you with 118 bits and 16 alphanumeric characters to store.

0x00 to 0x7F (if you mean inclusive) comprises 128 possible characters to represent. That means that each character can be identified by a combination of 7 bits. Come up with an index mapping each number those 7 bits can represent to the actual character. To represent 16 of your "alphanumeric" characters in this way, you need a total of 112 bits.

We now have 122 bits (or 15.25 bytes) representing our data. Add an easter egg to fill in the remaining unused bits and you have your 16 character array.

OP edited. The trade number field is 18 bytes, not 16. Sorry for the typo. — John Dibling, Aug 09 '10 at 14:36

Compress 21 Alphanumeric Characters in to 16 Bytes

8 Answers8

Linked

Related