Review: Protocol for encryption/decryption of big files with authentication

Question

I've been trying to figure out the best way to accomplish the task of encrypting big (several GB) files into the file system for later access.

I've been experimenting with several modes of AES (particularly CBC and GCM) and there are some pros and cons I've found on each approach.

After researching and asking around, I come to the conclusion that at least at this moment, using AES+GCM is not feasible for me, mostly because of the issues it has in Java and the fact that I can't use BouncyCastle.

So I am writing this to talk about the protocol I'm going to be implementing to complete the task. Please provide feedback as you see fit.

Encryption

Using AES/CBC/PKCS5Padding with 256 bit keys
The file will be encrypted using a custom CipherOutputStream. This output stream will take care of writing a custom header at the beginning of the file which will consist of at least the following:
- First few bytes to easyly tell that the file is encrypted
- IV
- Algorithm, mode and padding used
- Size of the key
- The length of the header itself
While the file is being encrypted, it will be also digested to calculate its authentication tag.
When the encryption ends, the tag will be appended at the end of the file. The tag is of a know size, so this makes it easy to later recover it.

Decryption

A custom CipherInputStream will be used. This stream knows how to read the header.
It will then read the authentication tag, and will digest the whole file (without encrypting it) to validate it has not been tampered (I haven't actually measure how this will perform, however it's the only way I can think of to safely start decryption wihtout the risk of knowing too late the file should not have been decrypted in the first place).
If the validation of the tag is ok, then the header will provide all the information needed to initialize the cipher and make the input stream decrypt the file. Otherwise it will fail.

Is this something that seems ok to you in order to handle encryption/decryption of big files?

Ok, sorry, forget the previous comment. Didn´t read the links until now. Question 1: Is the SHA-hash within the encryption layer (ie. SHA of the plaintext, then encrypt the hash too) or outside? Should be inside. Question 2: Why CBC? (Not "why not GCM", but why CBC of all other possible things?) — deviantfan, Nov 24 '14 at 20:42
**Question 1:** according to what I read, the sequence should be encrypt then mac, so that's what I'm intending to do. **Question 2:** Well, CBC was the first I experimented with, and I have the whole encryption done with it already, so I thought of using it. What would be the issue with it and what would be a better alternative in your opinion? — alejo, Nov 24 '14 at 20:49

score 2 · Accepted Answer · answered Nov 24 '14 at 21:24

Some points:

A) Hashing of the encrypted data, with the hash not encrypted itself.

One of the possible things a malicious human M could do without any hash: Overwrite the encrypted file with something else. M doesn´t know key, the plaintext before and/or the plaintext after this action, but he can change the plaintext to something different (usually, it becomes garbage data). Destruction is also a valid purpose for some people.

The "good" user with the key can still decrypt it without problems, but it won´t be the original plaintext. So far no problems if it´s garbage data if (and only if) you know for sure what´s inside, ie. how to recognize if it is unchanged. But do you know that in every case? And there´s a small chance that the "gargabe" data actually makes sense, but is not the real one anyways.

So, to recognize if the file was changed, you add a SHA hash of the encrypted data.
And if the evil user M overwrites the encrypted file part, he will do what with the hash? Right, he can recalculate it so that it matches the new encrypted data. Once again, you can´t recognize changes.
If the plaintext is hashed and then everything is encrypted, it´s pretty much impossible to get it right. Remember, M doesn´t know the key or anything. M can change the plaintext inside to "something", but can´t change the hash to the correct value for this something.

B) CBC

CBC is fine if you decrypt the whole file or nothing everytime.
If you want to access parts of it without decrypting the unused parts, look at XTS.

C) Processing twice

It will then read the authentication tag, and will digest the whole file (without encrypting it) to validate it has not been tampered (I haven't actually measure how this will perform, however it's the only way I can think of to safely start decryption wihtout the risk of knowing too late the file should not have been decrypted in the first place).

Depending on how the files are used, this in indeed necessary. Especially if you want to use the data during the final step already, before it has finished.

I don´t know details about the Java CipherOutputStream,
but besides that and the mentioned points, it looks fine to me.

A) So if I understand you correctly, what you mean should happen for the MAC is that it should be calculated over the plain text, and then plain text + mac encrypted. Is that what you are saying? Accoring to [this](http://crypto.stackexchange.com/questions/202/should-we-mac-then-encrypt-or-encrypt-then-mac) that's definitively a possible approach, however, not necessarily the one that provides the best set of features when compared with Encrypt then MAC. Am I getting someting wrong? — alejo, Nov 24 '14 at 21:58
C) The files will be more likely streamed, to be either read, replayed or whatever depending on their original format. My intention with that potential double processing is to not start decrypting anything that may need to be stored in a temp file for example (cause they are big files) to later find the file has been tampered at the very end, in which case I already gave too much information to a potential adversary — alejo, Nov 24 '14 at 22:00
B) CBC it is then, as I need all or nothing when accessing the files — alejo, Nov 24 '14 at 22:01
@alejo A: Please note that the link talks about MACs in general. Many (or, depending on the view, "all") MAC algorithms depend on a key themselves, so the attacker can´t regenerate it without knowing. The SHA family does not need keys, everyone can use it and the result will be the same everytime (for some given data). — deviantfan, Nov 24 '14 at 22:14
I'm using a MAC that relies on a key to be generated, so I think it's safe to do it over the encrypted data. Also, if I did it over the plain text, in order to authenticate it I would first need to decrypt, and in such case (because of the file size) I may be a target of side channel attacks I think — alejo, Nov 27 '14 at 23:42

Review: Protocol for encryption/decryption of big files with authentication

Encryption

Decryption

1 Answers1