0

This is a bit of a side project I have taken on to solve a no-fix issue for work. Our system outputs a code to represent a combination of things on another thing. Some example codes are:

9-9-0-4-4-5-4-0-2-0-0-0-2-0-0-0-0-0-2-1-2-1-2-2-2-4

9-5-0-7-4-3-5-7-4-0-5-1-4-2-1-5-5-4-6-3-7-9-72

9-15-0-9-1-6-2-1-2-0-0-1-6-0-7

The max number in one of the slots I've seen so far is about 150 but they will likely go higher.

When the system was designed there was no requirement for what this code would look like. But now the client wants to be able to type it in by hand from a sheet of paper, something the code above isn't suited for. We've said we won't do anything about it, but it seems like a fun challenge to take on.

My question is where is a good place to start loss-less compressing this code? Obvious solutions such as store this code with a shorter key are not an option; our database is read only. I need to build a two way method to make this code more human friendly.

Community
  • 1
  • 1
solipsicle
  • 864
  • 1
  • 9
  • 19
  • "Obvious solutions such as store this code with a shorter key are not an option; our database is read only." - gotta love questions with artificial constraints! Perhaps a surrogate key is in order... – Mitch Wheat Sep 03 '11 at 23:44
  • Do I understand that you simply want to make it easier to key in, not reduce the size of the data stored? – ExCodeCowboy Sep 03 '11 at 23:45
  • @Xenophile yes, ideally it would be a minimal number of numbers and letters. Something more like the end of a bit.ly link. – solipsicle Sep 03 '11 at 23:50
  • Humans are very poor at entering long lists of integers; apart from using an encoding (which is what you are really talking about rather than compression; though compresion is a side effect) like base 64. A better solution is not to have than enter anything directly, such as using a barcode. – Mitch Wheat Sep 04 '11 at 00:10
  • Based on the question title only: Use *Comic Sans*? :-P – j_random_hacker Sep 04 '11 at 04:45
  • @Mitch A 3D barcode? I'd love to see one of those. What do you scan it with? – Nick Johnson Sep 05 '11 at 00:55
  • you've almost certainly already seen one: http://en.wikipedia.org/wiki/Mobile_tagging , the name is slightly misleading... – Mitch Wheat Sep 05 '11 at 02:01

3 Answers3

1

1) I agree that you definately need a checksum - data entry errors are very common, unless you have really well trained staff and independent duplicate keying with automatic crosss-checking.

2) I suggest http://en.wikipedia.org/wiki/Huffman_coding to turn your list of numbers into a stream of bits. To get the probabilities required for this, you need a decent sized sample of real data, so you can make a count, setting Ni to the number of times number i appears in the data. Then I suggest setting Pi = (Ni + 1) / (Sum_i (Ni + 1)) - which smooths the probabilities a bit. Also, with this method, if you see e.g. numbers 0-150 you could add a bit of slack by entering numbers 151-255 and setting them to Ni = 0. Another way round rare large numbers would be to add some sort of escape sequence.

3) Finding a way for people to type the resulting sequence of bits is really an applied psychology problem but here are some suggestions of ideas to pinch.

3a) Software licences - just encode six bits per character in some 64-character alphabet, but group characters in a way that makes it easier for people to keep place e.g. BC017-06777-14871-160C4

3b) UK car license plates. Use a change of alphabet to show people how to group characters e.g. ABCD0123EFGH4567IJKL...

3c) A really large alphabet - get yourself a list of 2^n words for some decent sized n and encode n bits as a word e.g. GREEN ENCHANTED LOGICIAN... -

mcdowella
  • 19,301
  • 2
  • 19
  • 25
  • +1 for 3c: Such as, for example, encoding 8 bits per word using the PGP [biometric word list](http://en.wikipedia.org/wiki/biometric_word_list). – David Cary Aug 28 '12 at 17:21
0

i worried about this problem a while back. it turns out that you can't do much better than base64 - trying to squeeze a few more bits per character isn't really worth the effort (once you get into "strange" numbers of bits encoding and decoding becomes more complex). but at the same time, you end up with something that's likely to have errors when entered (confusing a 0 with an O etc). one option is to choose a modified set of characters and letters (so it's still base 64, but, say, you substitute ">" for "0". another is to add a checksum. again, for simplicity of implementation, i felt the checksum approach was better.

unfortunately i never got any further - things changed direction - so i can't offer code or a particular checksum choice.

ps i realised there's a missing step i didn't explain: i was going to compress the text into some binary form before encoding (using some standard compression algorithm). so to summarize: compress, add checksum, base64 encode; base 64 decode, check checksum, decompress.

andrew cooke
  • 45,717
  • 10
  • 93
  • 143
0

This is similar to what I have used in the past. There are certainly better ways of doing this, but I used this method because it was easy to mirror in Transact-SQL which was a requirement at the time. You could certainly modify this to incorporate Huffman encoding if the distribution of your id's is non-random, but it's probably unnecessary.

You didn't specify language, so this is in c#, but it should be very easy to transition to any language. In the lookup you'll see commonly confused characters are omitted. This should speed up entry. I also had the requirement to have a fixed length, but it would be easy for you to modify this.

static public class CodeGenerator
{
    static Dictionary<int, char> _lookupTable = new Dictionary<int, char>();

    static CodeGenerator()
    {
        PrepLookupTable();
    }

    private static void PrepLookupTable()
    {
        _lookupTable.Add(0,'3');
        _lookupTable.Add(1,'2');
        _lookupTable.Add(2,'5');
        _lookupTable.Add(3,'4');
        _lookupTable.Add(4,'7');
        _lookupTable.Add(5,'6');
        _lookupTable.Add(6,'9');
        _lookupTable.Add(7,'8');
        _lookupTable.Add(8,'W');
        _lookupTable.Add(9,'Q');
        _lookupTable.Add(10,'E');
        _lookupTable.Add(11,'T');
        _lookupTable.Add(12,'R');
        _lookupTable.Add(13,'Y');
        _lookupTable.Add(14,'U');
        _lookupTable.Add(15,'A');
        _lookupTable.Add(16,'P');
        _lookupTable.Add(17,'D');
        _lookupTable.Add(18,'S');
        _lookupTable.Add(19,'G');
        _lookupTable.Add(20,'F');
        _lookupTable.Add(21,'J');
        _lookupTable.Add(22,'H');
        _lookupTable.Add(23,'K');
        _lookupTable.Add(24,'L');
        _lookupTable.Add(25,'Z');
        _lookupTable.Add(26,'X');
        _lookupTable.Add(27,'V');
        _lookupTable.Add(28,'C');
        _lookupTable.Add(29,'N');
        _lookupTable.Add(30,'B');          
    }


    public static bool TryPCodeDecrypt(string iPCode, out Int64 oDecryptedInt)
    {
        //Prep the result so we can exit without having to fiddle with it if we hit an error.
        oDecryptedInt = 0;

        if (iPCode.Length > 3)
        {
            Char[] Bits = iPCode.ToCharArray(0,iPCode.Length-2);

            int CheckInt7 = 0; 
            int CheckInt3 = 0;
            if (!int.TryParse(iPCode[iPCode.Length-1].ToString(),out CheckInt7) ||
                !int.TryParse(iPCode[iPCode.Length-2].ToString(),out CheckInt3))
            {
                //Unsuccessful -- the last check ints are not integers.
                return false;
            }
            //Adjust the CheckInts to the right values.
            CheckInt3 -= 2;
            CheckInt7 -= 2;

            int COffset = iPCode.LastIndexOf('M')+1;


            Int64 tempResult = 0;
            int cBPos = 0;
            while ((cBPos + COffset) < Bits.Length)
            {
                //Calculate the current position.
                int cNum = 0;
                foreach (int cKey in _lookupTable.Keys)
                {
                    if (_lookupTable[cKey] == Bits[cBPos + COffset])
                    {
                        cNum = cKey;
                    }
                }
                tempResult += cNum * (Int64)Math.Pow((double)31, (double)(Bits.Length - (cBPos + COffset + 1)));
                cBPos += 1;
            }

            if (tempResult % 7 == CheckInt7 && tempResult % 3 == CheckInt3)
            {
                 oDecryptedInt =  tempResult;
                return true;    
            }


            return false;

        }
        else
        {
            //Unsuccessful -- too short.
            return false;
        }
    }
    public static string PCodeEncrypt(int iIntToEncrypt, int iMinLength)
    {
        int Check7 = (iIntToEncrypt % 7) + 2;
        int Check3 = (iIntToEncrypt % 3) + 2;

        StringBuilder result = new StringBuilder();
        result.Insert(0, Check7);
        result.Insert(0, Check3);

        int workingNum = iIntToEncrypt;

        while (workingNum > 0)
        {
            result.Insert(0, _lookupTable[workingNum % 31]);
            workingNum /= 31;
        }

        if (result.Length < iMinLength)
        {
            for (int i = result.Length + 1; i <= iMinLength; i++)
            {
                result.Insert(0, 'M');
            }
        }

        return result.ToString();
    }

}
ExCodeCowboy
  • 870
  • 7
  • 10