11

Thing is I have a file that has room for metadata. I want to store a hash for integrity verification in it. Problem is, once I store the hash, the file and the hash along with it changes.

I perfectly understand that this is by definition impossible with one way cryptographic hash methods like md5/sha.

I am also aware of the possibility of containers that store verification data separated from the content as zip & co do.

I am also aware of the possibility to calculate the hash separately and send it along with the file or to append it at the end or somewhere where the client, when calculating the hash, ignores it.

This is not what I want.

I want to know whether there is an algorithm where its possible to get the resulting hash from data where the very result of the hash itself is included.

It doesn't need to be cryptographic or fullfill a lot of criterias. It can also be based on some heuristics that after a realistic amount of time deliver the desired result.

I am really not so into mathematics, but couldn't there be some really advanced exponential modulo polynom cyclic back-reference devision stuff that makes this possible?

And if not, whats (if there is) the proof against it?

The reason why i need tis is because i want (ultimately) to store a hash along with MP4 files. Its complicated, but other solutions are not easy to implement as the file walks through a badly desigend production pipeline...

The Surrican
  • 29,118
  • 24
  • 122
  • 168
  • Answering the above question is at least as hard as this one: [Is there an MD5 Fixed Point where md5(x) == x?](http://stackoverflow.com/questions/235785/is-there-an-md5-fixed-point-where-md5x-x) – Greg Hewgill Dec 13 '10 at 19:30
  • 1
    @Greg: reread. The OP is aware that this is impossible with MD5 and SHA. – Jason S Dec 13 '10 at 19:31
  • its not a duplicate of that because this deals with a special cryptographic hash and the question is different because it needs to be the same as the function of itself. my question is about more algorithms, also non crypticals and as well data containing that hash. not beeing the hash itself. – The Surrican Dec 13 '10 at 19:33
  • @Jason S: It's not "impossible". It's hard. – Greg Hewgill Dec 13 '10 at 19:34

6 Answers6

7

It's possible to do this with a CRC, in a way. What I've done in the past is to set aside 4 bytes in a file as a placeholder for a CRC32, filling them with zeros. Then I calculate the CRC of the file.

It is then possible to fill the placeholder bytes to make the CRC of the file equal to an arbitrary fixed constant, by computing numbers in the Galois field of the CRC polynomial.

(Further details possible but not right at this moment. You basically need to compute (CRC_desired - CRC_initial) * 2-8*byte_offset in the Galois field, where byte_offset is the number of bytes between the placeholder bytes and the end of the file.)


Note: as per @KeithS's comments this solution is not to prevent against intentional tampering. We used it on one project as a means to tie metadata within an embedded system to the executable used to program it -- the embedded system itself does not have direct knowledge of the file(s) used to program it, and therefore cannot calculate a CRC or hash itself -- to detect inadvertent mismatch between an embedded system and the file used to program it. (In later systems I've just used UUIDs.)

Jason S
  • 184,598
  • 164
  • 608
  • 970
  • wow! crc is not so hard, did it on paper a while back during exams :D let me think that through... crc would also be ideal (good error detection) and it came to my mind when thinking about this +1 – The Surrican Dec 13 '10 at 19:34
  • The catch is that crc32 is 79,228,162,514,264,337,593,543,950,336 worse than md5 as a hash. A 32bit checksum is very easy to have collisions with. – Tyler Eaves Dec 13 '10 at 19:40
  • it is worse, i am aware of that. but its still a pretty good integrity checker, if that is the only requirement – The Surrican Dec 13 '10 at 19:42
  • 2
    This is basically Tyler's second option; hash the data, then tack the hash onto the front of the file. The disadvantage of this is that the hash is part of the file, meaning you have to trust a part of the file in order to ensure you can trust the rest of it. It also allows the ability to hack one file to tamper with files, instead of having to hack both the file and the MD5 signature repository; as such, it can only prove well-formedness of the file and not that it was what the legitimate author of the file uploaded. – KeithS Dec 13 '10 at 19:44
  • 1
    BTW: Don't do this if you are using SurroundSCM for source control. By default it uses CRC-32 as an indicator of whether a file has changed from the copy in the repository, with no option to use MD5 or SHA1 instead. (!!!) – Jason S Dec 13 '10 at 19:53
2

Of course this is possible, in a multitude of ways. However, it cannot prevent intentional tampering.

For example, let

hash(X) = sum of all 32-bit (non-overlapping) blocks of X modulo 65521. 

Let

Z = X followed by the 32-bit unsigned integer (hash(X) * 65521)

Then

hash(Z) == hash(X) == last 32-bits of Z

The idea here is just that any 32-bit integer congruent to 0 modulo 65521 will have no effect on the hash of X. Then, since 65521 < 2^16, hash has a range less then 2^16, and there are at least 2^16 values less than 2^32 congruent to 0 modulo 65521. And so we can encode the hash into a 32 bit integer that will not affect the hash. You could actually use any number less than 2^16, 65521 just happens to be the largest such prime number.

Chris Hopman
  • 2,082
  • 12
  • 11
1

I remember an old DOS program that was able to embed in a text file the CRC value of that file. However, this is possible only with simple hash functions.
Altough in theory you could create such file for any kind of hash function (given enough time or the right algorithm), the attacker would be able to use exactly the same approach. Even more, he would have a chose: to use exactly your approach to obtain such file, or just to get rid of the check.

It means that now you have two problems instead of one, and both should be implemented with the same complexity. It's up to you to decide if it worth it.

EDIT: you could consider hashing some intermediary results (like RAW decoded output, or something specific to your codec). In this way the decoder would have it anyway, but for another program it would be more difficult to compute.

ruslik
  • 14,714
  • 1
  • 39
  • 40
  • cool. thats what i need. thing is there won't be an attacker, the file is handled trough a production pipeline and only the file is transfered, i would have to update it on every step to pass a long a hash to theck for integrity. i just want to make sure that the file doesnt get damaged. – The Surrican Dec 13 '10 at 22:09
0

No, not possible. You either you a separate file for hashs ala md5sum, or the embedded hash is only for the "data" portion of the file.

Tyler Eaves
  • 12,879
  • 1
  • 32
  • 39
  • 1
    I know it sounds unlogical that there could be otherwise. but do you have any proof/reason for that? - edit: Jason S'S answer suggests otherwise – The Surrican Dec 13 '10 at 19:32
  • The hash requires complete knowledge of the file, and is contained within the file. So, you create the file, create the hash of the file which doesn't contain the hash, then add the hash to the file which changes its contents and thus hashing the file again would produce a different hash. The hash would be used as a checksum to prove the file had not been tampered with, so tampering with the file after hashing it (by adding the hash) would always fail a re-hash on the receiving end given a proper hashing algorithm. – KeithS Dec 13 '10 at 19:38
0

It depends on your definition of "hash". As you state, obviously with any pseudo-random hash this would be impossible (in a reasonable amount of time).

Equally obvious, there are of course trivial "hashes" where you can do this. Data with an odd number of bits set to 1 hash to 00 and an even number of 1s hash to 11, for example. The hash doesn't modify the odd/evenness of the 1 bits, so files hash the same when their hash is included.

patros
  • 7,719
  • 3
  • 28
  • 37
  • yeah i did that very exact example in my head and thought whether there was a possibility to extend the logic to proof for some other way but didnt get far :D – The Surrican Dec 13 '10 at 19:39
  • 1
    There are an infinite number of hash functions that will work. How many of them are useful is another matter... – patros Dec 13 '10 at 20:46
0

the way the nix package manager does this is by when calculating the hash you pretend the contents of the hash in the file are some fixed value like 20 x's and not the hash of the file then you write the hash over those 20 x's and when you check the hash you read that and ignore again it pretending the hash was just the fixed value of 20 x's when hashing

they do this because the paths at which a package is installed depend on the hash of the whole package so as the hash is of fixed length they set it as some fixed value and then replace it with the real hash and when verifying they ignore the value they placed and pretend it's that fixed value

but if you don't use such a method is it impossible

Dan D.
  • 73,243
  • 15
  • 104
  • 123
  • yeah, thats basically sending the hash seperately but in the same file... i am aware of that possibility but unfortunately cant do it. i'm not really convinced its impossible, look at what Jason S states. sounds logical to me. – The Surrican Dec 13 '10 at 19:40
  • 1
    so really this comes down to finding a fix point `H(h||m) = h` which really not something one should have to do at least during normal operation but this is stack overflow it's by definition abnormal – Dan D. Dec 13 '10 at 19:43
  • exactly. +1 for that, especially the last part ;) – The Surrican Dec 13 '10 at 19:47