0

I'm looking for a way to convert an alphanumeric string, e.g. "aBcd3f", into a purely numeric representation, and get the shortest possible input string. The valid characters in the input string are a-z, A-Z, 0-9, and the resultant string would be comprised only of digits 0-9.

Since there are 62 valid values for each character in the input string, I can assign values 00-61 to each input character, and covert the 6 input characters into a 12 character numeric string.

But I would like to get something more compact, if possible - e.g. 8-10 digits. Is it possible, and if so, are there any algorithms or functions for doing this in PHP?

Note that this has to be a 2-way function. I also need to be able to go back from the numeric string to the alphanumeric.

I haven't found this question asked on this site. My question is the opposite of this question, as I'm trying to go in the opposite direction.

Community
  • 1
  • 1
janman05
  • 57
  • 1
  • 7

1 Answers1

2

A decimal digit encodes log2(10) = 3.32 bits of information on average. Alphanumeric data has 62 possible "digits", so each one encodes log2(62) = 5.95 bits of information on average.

This means that converting from alphanumeric to decimal digits only will require approximately 5.95 / 3.32 = 1.79 times more characters in the output than there are in the input. If your output is constrained to 10 characters maximum you can expect it to encode at most 5.58 characters of alphanumeric input, which for practical purposes means just 5. There is no room for maneuvering here; this is cold math.

The manner of converting from one representation to the other is fairly straightforward, because in essence you are simply converting a number from base 62 to base 10 and back. You can tweak the code from this answer of mine only slightly to achieve the aim.

See it in action.

Note that with the (arbitrary) order of digits I picked the "largest" possible input with 5 characters is "ZZZZZ", which encodes to 9 decimal digits. If you expand the input to 6 characters the largest input would be "ZZZZZZ" which would need 11 decimal digits to encode -- more than the limit we imposed, as predicted.

Also note that this analysis assumes every possible input string is as likely to occur as any other, i.e. the input is perfectly random. If this is not the case then the actual information content of the input would be lower than the theoretical maximum and consequently you could take advantage of this with some kind of compression scheme.

Community
  • 1
  • 1
Jon
  • 428,835
  • 81
  • 738
  • 806