2

Base64 encodes three 8-bit characters onto onto four (base-64) 6-bit "characters". Base64 is efficient because it uses a base (64) and exponent (4) that happens to perfectly match a base-10 exponent of 2 (24): 3x8=4x6=24 and 224=644=16777216.

It appears that there are no base/exponent combinations that result in values that exactly match base-10 exponents of 2 (specifically 2n for any 0<n<256), except for base32, base64 and base128 (along with base4, base8, base16, base256, base512, etc which are more difficult to make practical use of). See the last code block for the full list of matching exponents!

To use the example of base92, 922 = 8464, with the closest base-10 exponent of 2 being 213 = 8192. 8464-8192 = 272 indices available on the base92 side that (if I understand correctly) are not possible to take advantage of. (219=524288 < 913=753571 < 220=1048576, but 220-913=295005. 272 is the clear winner.)

While pondering these lossiness problems earlier this afternoon, I arrived at an interesting question. If I look at input 14 bits at a time, and I ever come across an input sequence like 00100000 11111111, I could interpret that as 10000011111111, or 8447, and encode it within the area 8193<n<8464 that is otherwise inaccessible! However, once I reach 10000100010001, 8465, I would need to drop back to using 7-bit encoding. This makes this approach input-sensitive, and in practice would only be useful for two-byte sequences with a starting byte of 10000100 (132, ASCII 'Z'). I'd rather my encoder not be input-sensitive.

I have two questions:

  1. Does my understanding of base-n encoding seem to be generally correct? Besides the Wikipedia page, where else can I go to learn about it?

  2. Are there any tricks or techniques I can use to raise the efficiency of base-n compression for an arbitrary base other than base32/64/128?

 


References

Everything below this point is not required reading to answer my question.

This small PHP script computes every possible base^exponent (for 0<base<513 and 2<exponent<257, both picked arbitrarily) alongside every possible exponent of 2n (for 0<n<256). It then subtracts the results of the two calculations, and lists the results in ascending order.

It was written for Linux but should run fine on Windows if you comment out the fprintf() calls (which contain some escape sequences).

Note that this program will generate 30,145 lines of output :)

<?php

$c = 0;
$op = 0;

for ($i = 3; $i < 513; $i++) {
    for ($j = 2; $j < 257; $j++) {
        $p = ($c * 100) / 33835;
        if ($p > $op + 5 || $p == 100) {
            fprintf(STDERR, "\rgen:  %2.1f%%", $p);
            $op = $p;
        }
        if ($i >= 2 ** 32 || $j >= 2 ** 32 || $i ** $j >= 2 ** 32) continue;
        for ($k = 1; $k < 33; $k++) {
            $x = $i ** $j - 2 ** $k;
            if ($x < 0) continue;
            if (!isset($a[$x])) $a[$x] = [];
            $a[$x][] = [ $i, $j, $k, $i ** $j, 2 ** $k, (float)sprintf("%2.1f", ((2 ** $k) / ($i ** $j)) * 100) ];
            $c++;
        }
    }
}

foreach ($a as $i => $_) {
    if ($i < 0) {
        print "\r64-bit machine required\n";
        die;
    }
}

$q = $op = $c = 0;
foreach ($a as $i => $_) {
    $p = ($c * 100) / 28013;
    if ($p > $op + 5 || $p == 100) {
        fprintf(STDERR, "\rsort: %2.1f%%", $p);
        $op = $p;
    }
    $c++;
    uasort($a[$i], function($a, $b) {
        return $a[0] < $b[0];
    });
}

fprintf(STDERR, "\rsort root\e[K");
uksort($a, function($a, $b) {
    return $a > $b;
});

fprintf(STDERR, "\r\e[K");
$l = 0;
foreach ($a as $i => $z) {
    foreach ($z as $x) {
        print
            sprintf("%5d %5.1f%%", $l++, $x[5])
            .' ('.$x[0].'^'.$x[1].'='.sprintf("%.0f", $x[0] ** $x[1]).')'
            .'-(2^'.$x[2].'='.(2 ** $x[2]).')'
            .'='.$i
            ."\n";

    }
}

As noted at the end of the second paragraph at the start of this question, here is the full list of exponent combinations that perfectly match 2n for 0<n<256.

Format is: line number; (2^n) / (base^exponent) represented as a percentage; (base^exponent=result)-(2^exponent)=distance.

Note base64 on line 9. Base92 is on line 200 of the full output of the program above.

    0 100.0% (512^3=134217728)-(2^27=134217728)=0
    1 100.0% (512^2=262144)-(2^18=262144)=0
    2 100.0% (256^3=16777216)-(2^24=16777216)=0
    3 100.0% (256^2=65536)-(2^16=65536)=0
    4 100.0% (128^4=268435456)-(2^28=268435456)=0
    5 100.0% (128^3=2097152)-(2^21=2097152)=0
    6 100.0% (128^2=16384)-(2^14=16384)=0
    7 100.0% (64^3=262144)-(2^18=262144)=0
    8 100.0% (64^2=4096)-(2^12=4096)=0
    9 100.0% (64^4=16777216)-(2^24=16777216)=0
   10 100.0% (64^5=1073741824)-(2^30=1073741824)=0
   11 100.0% (32^6=1073741824)-(2^30=1073741824)=0
   12 100.0% (32^5=33554432)-(2^25=33554432)=0
   13 100.0% (32^4=1048576)-(2^20=1048576)=0
   14 100.0% (32^3=32768)-(2^15=32768)=0
   15 100.0% (32^2=1024)-(2^10=1024)=0
   16 100.0% (16^2=256)-(2^8=256)=0
   17 100.0% (16^7=268435456)-(2^28=268435456)=0
   18 100.0% (16^6=16777216)-(2^24=16777216)=0
   19 100.0% (16^5=1048576)-(2^20=1048576)=0
   20 100.0% (16^4=65536)-(2^16=65536)=0
   21 100.0% (16^3=4096)-(2^12=4096)=0
   22 100.0% (8^10=1073741824)-(2^30=1073741824)=0
   23 100.0% (8^8=16777216)-(2^24=16777216)=0
   24 100.0% (8^7=2097152)-(2^21=2097152)=0
   25 100.0% (8^6=262144)-(2^18=262144)=0
   26 100.0% (8^5=32768)-(2^15=32768)=0
   27 100.0% (8^4=4096)-(2^12=4096)=0
   28 100.0% (8^3=512)-(2^9=512)=0
   29 100.0% (8^2=64)-(2^6=64)=0
   30 100.0% (8^9=134217728)-(2^27=134217728)=0
   31 100.0% (4^3=64)-(2^6=64)=0
   32 100.0% (4^8=65536)-(2^16=65536)=0
   33 100.0% (4^4=256)-(2^8=256)=0
   34 100.0% (4^5=1024)-(2^10=1024)=0
   35 100.0% (4^6=4096)-(2^12=4096)=0
   36 100.0% (4^7=16384)-(2^14=16384)=0
   37 100.0% (4^13=67108864)-(2^26=67108864)=0
   38 100.0% (4^9=262144)-(2^18=262144)=0
   39 100.0% (4^10=1048576)-(2^20=1048576)=0
   40 100.0% (4^11=4194304)-(2^22=4194304)=0
   41 100.0% (4^12=16777216)-(2^24=16777216)=0
   42 100.0% (4^14=268435456)-(2^28=268435456)=0
   43 100.0% (4^15=1073741824)-(2^30=1073741824)=0
   44 100.0% (4^2=16)-(2^4=16)=0
i336_
  • 1,813
  • 1
  • 20
  • 41
  • 5
    Wait, are you asking if there is any `k` such that `k` is not a power of two but `k^m` is a power of 2, for some `m`? There cannot possibly be such a `k`. Think about prime factorisation. – AakashM Apr 04 '17 at 13:40
  • @AakashM: "*... There cannot possibly be such a `k`.*" Thanks for this - yes, this is effectively what I was wondering (but I'm also wondering if there are any tricks/workarounds I can do to add extra efficiency). My mathematical foundation is not as substantive as I wish it were. TIL about prime factorisation! – i336_ Apr 04 '17 at 14:47

1 Answers1

0

base128 is not effective because you must use characters witch codes greater than '128'. For charater witch codes >=128 chrome send two bytes... (so string witch 1MB of this characters on sending will be change to 2MB bytes... so you loose all profit). For base64 strings this phenomena does't appear (so we loose only ~33%). More details here in "update" section.

Kamil Kiełczewski
  • 85,173
  • 29
  • 368
  • 345