1

Final goal: I want to identify equality of mesh geometry data (and other attributes)

my choice of the mean: by comparing their "top sha1 hash".

But to make the top hash, I need to combine hashes from sub-routines hashing their own data.

TLDR

question: how do you combine cryptographical hashes, such as a sha1 hash ?
(I'd like a boost::hash_combine<160>(h1, h2))


NLEWTRM (Not Long Enough Want To Read More)

In typical procedural programming you have a "funtion call tree", equivalent to what happens when you let the compiler generate your copy constructors, and when you use it from a top-class, you get a cascade of calls which executes a "deep copy".

So, e.g. equivalently, you can write (manually) a series of hashable types which implements size_t hash_value() (an unordered containers requirement), and calling hash_value() on any type (via ADL) will result in a "deep hash" because you cared of using boost::hash_combine when calculating your object "local top hash."

I have seen how boost::hash_combine is implemented, and that is all well for a size_t.

This combination is necessary for progressive calculation of the hash from subroutines that respect encapsulation.

I reckon this problem is similar to "hash tree" or "hash lists" somehow, and also has been addressed in MD5 in terms of "blocks" (from my knowledge), when hashing a stream for which you can't store all data at once.

Note that in my usage I don't need THE HASH that would be made by hashing all the data at once. Because my implementation (call tree) would be stable, 2 different mesh-object will produce the same hash if they have the same data, and that is the only thing that counts. (because I need comparability, not officialism or canonicalism, nor do I need crypto strength. BUT I need reasonable uniqueness guarantees (more than what a size_t offers))

obvious way 1: what I think about now as a solution, is to use the sub hash themselves as a message, concatenate and re-hash. newsha = sha1(hash1 + hash2) with + a message (buffer) concatenation operation.

second obvious way: apply newsha = hash1 ^ hash2 with ^ a modulo operator.

v.oddou
  • 6,476
  • 3
  • 32
  • 63
  • I don't understand your problem. If you've "seen how hash_combine is implemented" - why not use the same general logic for your 160 bit SHA-1 values, or even split the SHA-1 hashes into 5 and use boost's hash_combine on each slice? Have you actually tried anything, and if so what and what concrete problems did it have? – Tony Delroy Apr 21 '15 at 09:13
  • @TonyD Good comment, no I havn't tried anything, and the reason is because of the sensitivity in the strength of the method that I cannot prove. (because I'm a n00b in crypto). The thing is, I **can** make a `hash_combine` for sha1, (c.f. small EDIT at the end). The question is: "will it be good enough, or would it create crazy flaws in the final hash?". – v.oddou Apr 21 '15 at 09:17
  • For example, if you take your idea of doing a succession of 32 bits combination, and check out this http://stackoverflow.com/q/4948780 you'll see that the "invariant" of having no correlation between the bits is broken by the repetition of the pattern. (it's the same problem than samples distribution on a pixel, repetition of a seemingly good random pattern (locally) introduces aliasing globally, if some severe criteria aren't respected) – v.oddou Apr 21 '15 at 09:30
  • I just found a very very related question here http://cstheory.stackexchange.com/questions/1684/associative-hash-mixing – v.oddou Apr 21 '15 at 09:36
  • 2
    Equal hashes do not guarantee equality of data. Any hash function will have clashes, where different inputs produce identical outputs. Different hashes can reliably show inequality in the data. Equal hashes may, or may not, indicate equality of the underlying data. – rossum Apr 21 '15 at 12:33
  • @rossum That's however *not true* for cryptographic hashes, where this is called a collision. Although there are endless inputs that map to the same hash value, generating such a *collision* means that the cryptographic hash is broken. – Maarten Bodewes Apr 21 '15 at 15:44
  • 1
    @Maarten All hashes have collisions. The point about crypto hashes is that collisions are *difficult* to find, not impossible. – rossum Apr 21 '15 at 16:16
  • @rossum yes of course. But in my application the guarantee provided is good enough. We are talking about comparing human created data, so considering the probability of running into a collision, we must be talking dozens of years before a bug occurs. And when it occurs, it won't cause human harm or property damage. this is just a game asset production pipeline we're talking about. – v.oddou Apr 22 '15 at 00:16
  • @v.oddou your first "obvious way" is great but slow; the second a classic mistake: given two SHA1 hashes of the same data somewhere in your "tree" - XOR cancels them out. Regarding hash_combining 5 slices independently, I don't follow your reasoning about why that's significantly flawed. Sure it's not perfect because a bit changed in one slice never impacts other slices (unless you also hash_combine or do some weaker interaction across slices), but your inputs are SHA1 hashes - you just need to avoid them cancelling each other out and ignoring bits - `<<` *and* `>>` with `^` handles that. – Tony Delroy Apr 22 '15 at 04:07
  • @TonyD sorry, I didn't want to say it was "significantly" flawed. Just slightly introducing coherency. Here is what I propose: Make an answer that says I can make a combiner with a specially cooked 160 bit golden ratio magical constant. (`2^160 / phi` ? ), I'll accept it because it is faster than the hash list. – v.oddou Apr 22 '15 at 05:00

2 Answers2

2

If you've "seen how hash_combine is implemented" - why not use the same general logic for your 160 bit SHA-1 values? That is, instead of...

seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

...add a 160-bit constant that has about half its bits on and in a nicely random arrangement. (You'll need to implement 160-bit arithmetic or use a suitable library - e.g. GMP). There's no particular need to shift by larger amounts given your SHA1 hashes are high quality inputs - no reason to think the less-significant-bits are consistently more or less random than others. If you want to follow boost's example for picking a constant, reapply the logic using 2^160 instead of 2^32:

phi = (1 + sqrt(5)) / 2
the_constant = 2^160 / phi
             = 0x4F1B BCDC BFA5 3E0A F9CE 6030 2E76 E41A 0841 13B5

WARNING I have not analysed the implications of this approach to cryptographic applications.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • instead of GMP do you think that'll be OK to apply 2 times 64 bits and 1 time 32 bits, on the corresponding subparts of `seed` and `v` ? `seed[0] ^= ((u64*)hash1)[0] + 0x4f..0a + (((u64)hash2)[0] << 6) + (((u64)hash2[0] >> 2); seed[1] ^= ((u64*)hash1)[1] + 0xf9..1a + ...` – v.oddou Apr 22 '15 at 06:27
  • @v.oddou: could do - up to you whether you want to do a rough job or handle carries from the additions, the 6 bits lost when shifting left, the 2 when shifting right... the more you do the stronger/slower it'll be. – Tony Delroy Apr 22 '15 at 06:39
  • 1
    Are you sure this is cryptographically secure? Do you have any pointers to papers claiming that this could be cryptographically secure? Or is this a "good enough" approach that may not be cryptographically secure? Fine if it is, but a warning is in order in that case. – Maarten Bodewes Apr 22 '15 at 22:31
  • 1
    @MaartenBodewes: I make no claims regarding the security of this - the OP's stated *"this is just a game asset production pipeline we're talking about"*, which I take to mean the worst that can happen is a player gains or loses some items. Still, I'll add a warning for the benefit of whomever else looks at this question for whatever purposes.... – Tony Delroy Apr 23 '15 at 00:53
  • @TonyD No problem, that's the only thing I was after, voted up. I was just commenting here to safeguard the crypto security of whomever else :) – Maarten Bodewes Apr 23 '15 at 14:13
  • @MaartenBodewes : I concur and +1 your concern. Future googlers need to understand this point. so the bold warning definitely has its place. For the final note, I shall add that I chose your "hash list" solution. Because the bit mix I ended up doing, was not too far from the actual sha1 function, so in terms of ALU cycles it made no sense anymore to lose the good crypto properties of sha1 to the "benefit" of the boost style combining. today memory access is the bottleneck, not bit mangling ALU ops. – v.oddou Apr 24 '15 at 01:26
1

obvious way: what I think about now as a solution, is to use the sub hash themselves as a message, concatenate and re-hash. newsha = sha1(hash1 + hash2) with + a message (buffer) concatenation operation.

Yes, this is basically what a hash-list is, which you already mentioned.

Don't use the XOR method, each time you use it you make it easier for somebody to create a collision (and generally the amount of data to hash will be much less than the amount of data hashed in the lower layers for hash lists, so performance should not be much of an issue). And completely forgot the issue about identical hashes canceling each other out, kudos go to Tony D for mentioning it.

Note that the idea of a hash list does require an implicit ordering of elements. If you don't have a specific order then you could sort the hashes in memory before performing the concatenation. Ordering the hashes would e.g. be ideal for calculating a cryptographically secure hash over elements in an std:unordered_set.

Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
  • Absolutely. You're spot on mentioning performance and order. I thought about order in particular. I just need to care if one future day the hasher becomes multithreaded. About performance, its not a real time app (exporter) but speed is still appreciated. Imagine re-exporting a whole bluray of resources when the engine version changes. This takes about one week today. Better perfs would be tangible business value. I thought the XOR technique wasn't really disproven by the community yet. And that it is used in TLS 1 ? (read that somewhere) – v.oddou Apr 22 '15 at 00:23
  • 1
    *"you could sort the hashes in memory"* - that might be good enough, but it does mean that if the data leading to two SHA1 hashes is swapped, the combined hash won't change. – Tony Delroy Apr 22 '15 at 04:17
  • @TonyD Eh, yes, that was the point. You should only sort if the elements don't have an implicit ordering. If they do have an implicit ordering, then you can use the same order for the concatenation of hashes. If both `A|B` and `B|A` are acceptable then you need to create a *canonical* way of creating the hash list. Sorting the hashes is an easy way of performing the canonicalization. You could for instance have an `unordered_set` that you want to include. In that case you need an explicit ordering of the hashes to do a compare with a set with the same elements in a different order. – Maarten Bodewes Apr 22 '15 at 22:20