66

If I were to AES-encrypt a file, and then ZLIB-compress it, would the compression be less efficient than if I first compressed and then encrypted?

In other words, should I compress first or encrypt first, or does it matter?

Cheeso
  • 189,189
  • 101
  • 473
  • 713
Sei Satzparad
  • 1,137
  • 1
  • 9
  • 12
  • 1
    They are not the same question at all. This question asks about efficiency, the other is about security. – Ferruccio Jun 28 '16 at 18:02
  • I feel like this question was never fully answered, as the answers all seem to discuss efficiency from the standpoint of "compressed data size" (or compression ratio, or whatever you want to call it). Another aspect to consider is the total CPU time needed to process the data, and by this measure for a compressible (i.e. text, not binary) payload of nontrivial size (i.e. anything over a few kB) it's more computationally efficient to compress and then encrypt (even versus just encrypting the uncompressed data and performing no compression at all). – aroth Jun 25 '20 at 07:46

6 Answers6

73

Compress first. Once you encrypt the file you will generate a stream of seemingly random data, which will be not be compressible. The compression process depends on finding compressible patterns in the data.

jasonleonhard
  • 12,047
  • 89
  • 66
Ferruccio
  • 98,941
  • 38
  • 226
  • 299
  • 11
    It's not really random. It's just that no compression algorithm will be able to spot the pattern anymore after it's encrypted. – finnw Jan 13 '11 at 02:04
  • 1
    True enough. It looks random. The process is deterministic, so given the same data and key you will get the same random looking result. – Ferruccio Jan 13 '11 at 11:46
  • 2
    @finnw Supposing the encryption algorithm takes steps to remove patterns (such as using a block cipher in CBC mode with a random IV), encrypted data is indistinguishable from random data. – yfeldblum Jan 13 '11 at 14:06
  • @Ferruccio If you use, for example, a block cipher in CBC mode with a random IV, then, given the same data and the same key, you will get a different random-looking result. – yfeldblum Jan 13 '11 at 14:07
  • @Justice, I can distinguish it from random data (as long as you give me the key.) – finnw Jan 13 '11 at 14:10
  • @Justice - if you give it a random initial value, do you then need that same initial value to decrypt it? If that's the case, then isn't that initial value effectively part of the key? – Ferruccio Jan 13 '11 at 14:59
  • @finnw Absolutely correct. (I just haven't given you the key.) – yfeldblum Jan 13 '11 at 16:38
  • @Ferruccio Yes, you need the same IV to decrypt the ciphertext. But no, it's not part of the key. You can prepend the IV to the ciphertext and store the concatenated result (the corresponding step for decryption is obvious). If you do this, you must simply generate a new IV for each new message stored, and the IV must be generated by a cryptographically strong PRNG. The IV is not a *secret* - it just has to be *random* and then you're golden: encrypted data is indistinguishable from random data. – yfeldblum Jan 13 '11 at 16:42
  • @Justice - I see. So my original statement should have been "The process is deterministic, so given the same data, key and initial value (if you're using cipher block chaining) you will get the same random looking result." – Ferruccio Jan 13 '11 at 19:38
  • @Ferruccio Correct. Note that if you use the same IV for multiple messages that are not chained to each other then the IV *is* a secret. And in this case, given the same data, key and IV, then the ciphertexts of two identical messages will themselves be identical. – yfeldblum Jan 13 '11 at 20:13
  • @yfeldblum *"encrypted data is indistinguishable from random data."* That's only if the encryption algorithm isn't broken. – NullUserException Jan 26 '13 at 16:31
  • @NullUserException: Indeed. So, my statements presuppose, for example, that you're not using either the rot-k or the des ciphers. – yfeldblum Jan 26 '13 at 23:18
  • 3
    Does any practical (finite-time) compression algorithm fully randomize the message? Isn't there always a residual pattern to the compressed data that can then be used to break the encryption (theoretically)? Don't many compression algorithms add header/footer signatures in the compressed file? The nature/pattern of that residual will be the same across multiple different compressed messages (assuming they used the same compression algorithm), and that gives the cracker statistical information about the unencrypted (but compressed) message that he can potentially exploit. – hobs Jan 07 '14 at 00:19
  • If you want the better file size and security, compress, encrypt, then compress again. The final compression won't be able to compress anything but it will randomize the data. – Zintom Jan 23 '21 at 18:09
39

Compression before encryption is surely more space efficient but in the same time less secure. That's why I would disagree with other answers.

Most compression algorithms use "magic" file headers and that could be used for statistical attacks.

For example, there is a CRIME SSL/TLS attack.

Community
  • 1
  • 1
maxbublis
  • 1,273
  • 10
  • 21
  • So, is it a trade-off, then? Looks like the two choices are: 1) Encrypt, then compress for greater security but less effective compression. 2) Compress, then encrypt for more effective compression but introduce a vulnerability. – Ajoy Bhatia Jan 08 '18 at 23:07
  • 2
    Doesn't this require a known-plaintext attack to be viable against your chosen cryptographic primitive? – Awn May 31 '18 at 21:27
  • If you want the better file size and security, compress, encrypt, then compress again. The final compression won't be able to compress anything but it will randomize the data. – Zintom Jan 23 '21 at 18:05
24

If your encryption algorithm is any good (and AES, with a proper chaining mode, is good), then no compressor will be able to shrink the encrypted text. Or, if you prefer it the other way round: if you succeed in compressing some encrypted text, then it is high time to question the quality of the encryption algorithm…

That is because the output of an encryption system should be indistinguishable from purely random data, even by a determined attacker. A compressor is not a malicious attacker, but it works by trying to find non-random patterns which it can represent with fewer bits. The compressor should not be able to find any such pattern in encrypted text.

So you should compress data first, then encrypt the result, not the other way round. This is what is done in, e.g., the OpenPGP format.

Community
  • 1
  • 1
Thomas Pornin
  • 72,986
  • 14
  • 147
  • 189
10

Compress first. If you encrypt then your data turns into (essentially) a stream of random bits. Random bits are incompressable because compression looks for patterns in the data and a random stream, by definition, has no patterns.

Cameron Skinner
  • 51,692
  • 2
  • 65
  • 86
3

Of course it matters. It's generally better to compress first and then to encrypt.

ZLib uses Huffman coding and LZ77 compression. The Huffman tree will be more balanced and optimum if it's performed on plain text for instance and so the compression rate will be better.

Encryption can follow after compression even if the compression result appear to be "encrypted" but can easily be detected to be compressed because the file usually starts with PK.

ZLib don't provide encryption natively. That's why I've implemented ZeusProtection. The source code is also available at github.

mihaipopescu
  • 957
  • 9
  • 15
1

From a practical perspective, I think you should compress first simply because many files are pre-compressed. For example, video encoding usually involves heavy compression. If you encrypt this video file then compress it, it has now been compressed twice. Not only will the second compression get a dismal compression ratio, but compressing again will take a great deal of resources to compress large files or streams. As Thomas Pornin and Ferruccio stated, compression of encrypted files may have little effect anyway because of the randomness of the encrypted files.

I think the best, and simplest, policy may be to compress files only-as-needed beforehand (using a whitelist or blacklist), then encrypt them regardless.

Community
  • 1
  • 1
Victor Stoddard
  • 3,582
  • 2
  • 27
  • 27