9

Is there a common method to encode and decode arbitrary data so the encoded end result consists of numbers only - like base64_encode but without the letters?

Fictitious example:

$encoded = numbers_encode("Mary had a little lamb");

echo $encoded; // outputs e.g. 12238433742239423742322 (fictitious result)

$decoded = numbers_decode("12238433742239423742322");

echo $decoded; // outputs "Mary had a little lamb"
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • 2
    A string is just a set of numbers that map to human readable characters. Tell us a little more about why you want to do something like this and you might get a good answer. Do you want to be able to convert the number back to the original string? If not, a Hashing function would probably suffice. – William Leader Jun 05 '10 at 21:18
  • 1
    @William in my current case, I want to convert a 16-character URL identifier that consists of numbers and letters (internal ID, looks ugly) into a "numbers only" representation to make it easier on the eye, for use as anchors to access different content blocks in a CMS. – Pekka Jun 05 '10 at 21:20
  • 1
    @Pekka: Your fictitious result seems a bit optimistic, doesn't it? It's one character shorter than the original string! ;-) – Andy E Jun 05 '10 at 21:24
  • 1
    @Andy E good point! The same thought crossed my mind. It's fictitious as in, hacking randomly on the keyboard without regard to length :) I'll add a few characters for good measure. – Pekka Jun 05 '10 at 21:26
  • I was just jesting, but I'm sure those extra two characters will have made all the difference ;-D – Andy E Jun 05 '10 at 21:34
  • 3
    @Andy hey, I don't just want encoding, I want compression to go with it! That goes without saying. And in four lines of PHP please. – Pekka Jun 05 '10 at 21:38
  • 1
    @Pekka, Searching high and low for your base8 equivalent but not coming up with much past ord(). Even in some other threads I've seen that approach (http://stackoverflow.com/questions/2634427/code-golf-numeric-equivalent-of-an-excel-column-name/2653617#2653617). I'll keep looking but I suspect unless you write your own hashing function you're stuck with ord() or string-to-hex-to-dec. – allnightgrocery Jun 05 '10 at 21:41
  • 1
    Why is digits only 'neater' than hexadecimal, or base64? – Nick Johnson Jun 05 '10 at 22:43
  • I don't think it's realistic to expect an alphabet of 10 characters to be able to perform compression and encoding on a string with an alphabet of 36 characters without watching the string expand in size. – Joel Etherton Jun 06 '10 at 02:05
  • 2
    @Joel I was joking, having given the fictitious number too few digits initially. – Pekka Jun 06 '10 at 09:59

4 Answers4

14

You can think of a (single byte character) string as a base-256 encoded number where "\x00" represents 0, ' ' (space, i.e., "\x20") represents 32 and so on until "\xFF", which represents 255.

A representation only with numbers 0-9 can be accomplished simply by changing the representation to base 10.

Note that "base64 encoding" is not actually a base conversion. base64 breaks the input into groups of 3 bytes (24 bits) and does the base conversion on those groups individually. This works well because a number with 24 bits can be represented with four digits in base 64 (2^24 = 64^4).

This is more or less what el.pescado does – he splits the input data into 8-bit pieces and then converts the number into base 10. However, this technique has one disadvantage relatively to base 64 encoding – it does not align correctly with the byte boundary. To represent a number with 8-bits (0-255 when unsigned) we need three digits in base 10. However, the left-most digit has less information than the others. It can either be 0, 1 or 2 (for unsigned numbers).

A digit in base 10 stores log(10)/log(2) bits. No matter the chunk size you choose, you're never going to be able to align the representations with 8-bit bytes (in the sense of "aligning" I've described in the paragraph before). Consequently, the most compact representation is a base conversion (which you can see as if it were a "base encoding" with only one big chunk).

Here is an example with bcmath.

bcscale(0);
function base256ToBase10(string $string) {
    //argument is little-endian
    $result = "0";
    for ($i = strlen($string)-1; $i >= 0; $i--) {
        $result = bcadd($result,
            bcmul(ord($string[$i]), bcpow(256, $i)));
    }
    return $result;
}
function base10ToBase256(string $number) {
    $result = "";
    $n = $number;
    do {
        $remainder = bcmod($n, 256);
        $n = bcdiv($n, 256);
        $result .= chr($remainder);
    } while ($n > 0);

    return $result;
}

For

$string = "Mary had a little lamb";
$base10 = base256ToBase10($string);
echo $base10,"\n";
$base256 = base10ToBase256($base10);
echo $base256;

we get

36826012939234118013885831603834892771924668323094861
Mary had a little lamb

Since each digit encodes only log(10)/log(2)=~3.32193 bits expect the number to tend to be 140% longer (not 200% longer, as would be with el.pescado's answer).

Community
  • 1
  • 1
Artefacto
  • 96,375
  • 17
  • 202
  • 225
7

Well, that would be "base 8" encoding rather than Base 64. This is better know as Octal.

All Base64 does is convert bit streams in to 6 bit blocks (0-63), and assigns a character from a 64 character character set. Octal uses 3 bits, 0-7. So it COULD use ABCDEFGH, but instead uses 0-7. You can't (easily) use 0-9 because 0-9 is up to 4 bits, but not completely 4 bits. That's what makes it a lousy encoding for binary data.

Will Hartung
  • 115,893
  • 19
  • 128
  • 203
  • I see, cheers for the background. I need this to build URLs from ugly-looking (but only 16 character) identifiers so the efficiency aspect isn't important. There's an implementation in the user contributed notes: http://de.php.net/manual/en/function.base64-encode.php#78765 I'll try to get that to work in base 8. – Pekka Jun 05 '10 at 21:16
  • 1
    It doesn't have to be base 8 - it could equally be base 10. – Nick Johnson Jun 05 '10 at 22:43
3

Regardless of how you encode you'll always end back up at a smaller base. It may be possible to shrink the resultant integer a bit smaller with some dechex() conversions but ultimately you'll only save a few characters. That being said, the number really balloons the moment you start representing multi-byte characters with 0-9.

I have to wonder if integers as IDs, representing words, or complete strings, wouldn't provide a smaller footprint. Not really a direct encoding but a viable option.

@el.pescado gets credit for the first half but he did challenge the reader. So, I responded (mainly because I wanted to understand what's happening).

function pekka_encode($s) {
    $out = '';
    for ($i=0;$i<strlen($s); $i++) {
        $out .= sprintf("%03d", ord($s[$i]));     
    }
    return $out;
}

function pekka_decode($s) {
    $out = '';
    for ($i=0;$i<strlen($s);$i+=3) {
        $out .= chr($s[$i].$s[$i+1].$s[$i+2]);
    }
    return $out;
}
allnightgrocery
  • 1,370
  • 1
  • 8
  • 15
2

Very simple example - it represents every input byte as 3-digit decimal number:

function data2numbers ($data) {
    $out = "";
    for ($i = 0; $i < strlen ($data); $i++) {
        $out .= sprintf ("%03d", ord ($data[$i]));
    }
    return $out;
}

Downside is that it triples size of any input data (every input byte is represented as three output bytes).

Decoding function is left as an exercise to the reader;)

  • Clever! I had thought about that. It *will* take up a lot more space than necessary, but it will do for my purposes. I will wait though and see whether somebody comes up with a real "base8" implementation in the spirit of the question :) – Pekka Jun 05 '10 at 21:24