Can zlib compressed output avoid using certain byte value?

Question

It seems that the output of zlib.compress uses all possible byte values. Is this possible to use 255 of 256 byte values (for example avoid using \n)?

Note that I just use the python manual as a reference, but the question is not specific to python (i.e. any other languages that has a zlib library).

Zlib complression doesn’t use all possible ‘characters’ it uses all possible 8-bit byte values, i.e. 0-255. It should be technically possible to implement your own similar compression scheme that does avoid a particular value, but it woudn’t be interchangeable with standard zlib complression. — DisappointedByUnaccountableMod, Jun 25 '20 at 22:39
Python’s zip library is implemented in Python - the source is there, you can create your own ‘user1424739lib compression. — DisappointedByUnaccountableMod, Jun 25 '20 at 22:41
You could use some sort of escape sequence to replace any newlines in the compressed data - for example, replace newlines with `X1`, replace actual `X`s with `X2`, reverse those replacements at the receiving end before decompressing. (This is the same basic idea as how programming languages let you include quote marks in a quoted string literal, by preceding them with a backslash.) This unavoidably cancels out part of your compression - by a factor of 1/128 on average, up to a factor of 2 if the compressed data happens to consist entirely of bytes needing escaping (but that's very unlikely). — jasonharper, Jun 25 '20 at 22:57
I think you could convert the output to be between 0-254, but not (easily) be able to skip a specific value in the range of 0-255. When that be acceptable? — martineau, Jun 25 '20 at 23:00
@martineau OK. This seems to be a reasonable workaround. Could you provide a python implementation to convert the result of zlib.compress and convert it back? — user1424739, Jun 25 '20 at 23:01
Good news, I was able to do it, and without needing to restrict the values to be 0-254 — see answer I posted. — martineau, Jun 27 '20 at 02:06

score 1 · Answer 1 · answered Jun 25 '20 at 22:32

1

No, this is not possible. Apart from the compressed data itself, there is standardized control structures which contain integers. Those integers may accidentially lead to any 8-bit character ending up in the bytestream.

Your only chance would be to encode the zlib bytestream into another format, e.g. base64.

answered Jun 25 '20 at 22:32

ypnos

50,202
14
95
141

1

base64 inviolates the original purpose of compress the stream. Is there a way to convert a 256 possible byte stream to 255 possible type stream so that I reserve '\n' for my own purpose? – user1424739 Jun 25 '20 at 22:37

Mark Adler · Answer 2 · 2020-06-27T02:59:56.740

The whole point of compression is to reduce the size as much as possible. If zlib or any compressor only used 255 of the 256 byte values, the size of the output would be increased by at least 0.07%.

That may be perfectly fine for you, so you can simply post-process the compressed output, or any data at all, to remove one particular byte value at the expense of some expansion. The simplest approach would be to replace that byte when it occurs with a two-byte escape sequence. You would also then need to replace the escape prefix with a different two-byte escape sequence. That would expand the data on average by 0.8%. That is exactly what Hans provided in another answer here.

If that cost is too high, you can do something more sophisticated, which is to decode a fixed Huffman code that encodes 255 symbols of equal probability. To decode you then encode that Huffman code. The input is a sequence of bits, not bytes, and most of the time you will need to pad the input with some zero bits to encode the last symbol. The Huffman code turns one symbol into seven bits and the other 254 symbols into eight bits. So going the other way, it will expand the input by a little less than 0.1%. For short messages it will be a little more, since often less than seven bits at the very end will be encoded into a symbol.

Implementation in C:

// Placed in the public domain by Mark Adler, 26 June 2020.

// Encode an arbitrary stream of bytes into a stream of symbols limited to 255
// values. In particular, avoid the \n (10) byte value. With -d, decode back to
// the original byte stream. Take input from stdin, and write output to stdout.

#include <stdio.h>
#include <string.h>

// Encode arbitrary bytes to a sequence of 255 symbols, which are written out
// as bytes that exclude the value '\n' (10). This encoding is actually a
// decoding of a fixed Huffman code of 255 symbols of equal probability. The
// output will be on average a little less than 0.1% larger than the input,
// plus one byte, assuming random input. This is intended to be used on
// compressed data, which will appear random. An input of all zero bits will
// have the maximum possible expansion, which is 14.3%, plus one byte.
int nolf_encode(FILE *in, FILE *out) {
    unsigned buf = 0;
    int bits = 0, ch;
    do {
        if (bits < 8) {
            ch = getc(in);
            if (ch != EOF) {
                buf |= (unsigned)ch << bits;
                bits += 8;
            }
            else if (bits == 0)
                break;
        }
        if ((buf & 0x7f) == 0) {
            buf >>= 7;
            bits -= 7;
            putc(0, out);
            continue;
        }
        int sym = buf & 0xff;
        buf >>= 8;
        bits -= 8;
        if (sym >= '\n' && sym < 128)
            sym++;
        putc(sym, out);
    } while (ch != EOF);
    return 0;
}

// Decode a sequence of symbols from a set of 255 that was encoded by
// nolf_encode(). The input is read as bytes that exclude the value '\n' (10).
// Any such values in the input are ignored and flagged in an error message.
// The sequence is decoded to the original sequence of arbitrary bytes. The
// decoding is actually an encoding of a fixed Huffman code of 255 symbols of
// equal probability.
int nolf_decode(FILE *in, FILE *out) {
    unsigned long lfs = 0;
    unsigned buf = 0;
    int bits = 0, ch;
    while ((ch = getc(in)) != EOF) {
        if (ch == '\n') {
            lfs++;
            continue;
        }
        if (ch == 0) {
            if (bits == 0) {
                bits = 7;
                continue;
            }
            bits--;
        }
        else {
            if (ch > '\n' && ch <= 128)
                ch--;
            buf |= (unsigned)ch << bits;
        }
        putc(buf, out);
        buf >>= 8;
    }
    if (lfs)
        fprintf(stderr, "nolf: %lu unexpected line feeds ignored\n", lfs);
    return lfs != 0;
}

// Encode (no arguments) or decode (-d) from stdin to stdout.
int main(int argc, char **argv) {
    if (argc == 1)
        return nolf_encode(stdin, stdout);
    else if (argc == 2 && strcmp(argv[1], "-d") == 0)
        return nolf_decode(stdin, stdout);
    fputs("nolf: unknown options (use -d to decode)\n", stderr);
    return 1;
}

Hans Musgrave · Answer 3 · 2020-06-25T23:50:40.433

0

As @ypnos says, this isn't possible within zlib itself. You mentioned that base64 encoding is too inefficient, but it's pretty easy to use an escape character to encode a character you want to avoid (like newlines).

This isn't the most efficient code in the world (and you might want to do something like finding the least used bytes to save a tiny bit more space), but it's readable enough and demonstrates the idea. You can losslessly encode/decode, and the encoded stream won't have any newlines.

def encode(data):
    # order matters
    return data.replace(b'a', b'aa').replace(b'\n', b'ab')

def decode(data):
    def _foo():
        pair = False
        for b in data:
            if pair:
                # yield b'a' if b==b'a' else b'\n'
                yield 97 if b==97 else 10
                pair = False
            elif b==97:  # b'a'
                pair = True
            else:
                yield b
    return bytes(_foo())

As some measure of confidence you can check this exhaustively on small bytestrings:

from itertools import *

all(
    bytes(p) == decode(encode(bytes(p)))
        for c in combinations_with_replacement(b'ab\nc', r=6)
        for p in permutations(c)
)

edited Jun 25 '20 at 23:50

answered Jun 25 '20 at 23:28

Hans Musgrave

6,613
1
18
37

Does `decode()` work correctly? What if the original input contains `ab`? – user1424739 Jun 25 '20 at 23:35
Yes. You get `aab` which is decoded back to `ab`. The code could return an error if an `a` is followed by anything other than an `a` or `b`, but it is liberal instead and returns `\n` for anything other than an `a` following an `a`. – Mark Adler Jun 26 '20 at 15:33
@MarkAdler The decode() function did have a bug that I fixed in an edit (and `ab` was a case where it would fail). That's a good point though that decode() will try to give _some_ answer even for invalid inputs. – Hans Musgrave Jun 26 '20 at 15:47
Ah, ok. I didn't notice that the comment preceded an edit. – Mark Adler Jun 26 '20 at 15:49

Can zlib compressed output avoid using certain byte value?

3 Answers3