1

Could somebody help me to understand what is the most significant byte of a 160 bit (SHA-1) hash?

I have a C# code which calls the cryptography library to calculate a hash code from a data stream. In the result I get a 20 byte C# array. Then I calculate another hash code from another data stream and then I need to place the hash codes in ascending order.

Now, I'm trying to understand how to compare them right. Apparently I need to subtract one from another and then check if the result is negative, positive or zero. Technically, I have 2 20 byte arrays, which if we look at from the memory perspective having the least significant byte at the beginning (lower memory address) and the most significant byte at the end (higher memory address). On the other hand looking at them from the human reading perspective the most significant byte is at the beginning and the least significant is at the end and if I'm not mistaken this order is used for comparing GUIDs. Of course, it will give us different order if we use one or another approach. Which way is considered to be the right or conventional one for comparing hash codes? It is especially important in our case because we are thinking about implementing a distributed hash table which should be compatible with existing ones.

alex.49.98
  • 609
  • 5
  • 13

2 Answers2

7

You should think of the initial hash as just bytes, not a number. If you're trying to order them for indexed lookup, use whatever ordering is simplest to implement - there's no general purpose "right" or "conventional" here, really.

If you've got some specific hash table you want to be "compatible" with (not even sure what that would mean) you should see what approach to ordering that hash table takes, assuming it's even relevant. If you've got multiple tables you need to be compatible with, you may find you need to use different ordering for different tables.

Given the comments, you're trying to work with Kademlia, which based on this document treats the hashes as big-endian numbers:

Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero.

That's just an arbitrary interpretation of the bytes - so long as everyone uses the same interpretation, it will work... but it would work just as well if everyone decided to interpret them as little-endian numbers.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    *You should think of a hash as just bytes, not a number.* - in the context of a DHT that's not necessarily correct since DHTs employ distance metrics, allowing some arithmetic operations to be performed on the IDs (inequalities, differences). So in a sense they are numbers. As for "conventional", they are network protocols, so network byte order would be a reasonable guess, but still just a guess. So consulting the protocol specs certainly is necessary. – the8472 May 18 '15 at 18:24
  • @the8472: I would say that's imposing a numeric interpretation *on* the bytes for the sake of some arbitrary ordering - but they're not meaningful numbers in any other way. It's not like increasing one byte in the source data is guaranteed to increase the interpreted number. So long as the bytes are interpreted the same way by all parties, you'll get consistent ordering, distances etc. – Jon Skeet May 18 '15 at 18:31
  • Fair enough. Do you know which byte order is used by Kademlia DHT, for example? As I understand the distance between nodes is calculated by XOR operation but then we need to know which node is closer and which byte order is used in this case? – alex.49.98 May 18 '15 at 18:32
  • @alex.49.98: No, but I'd expect it to be fairly easy to work out based on either documentation or experimentation. – Jon Skeet May 18 '15 at 18:33
  • @alex.49.98: After 3 minutes of searching I found [this doc](http://xlattice.sourceforge.net/components/protocol/kademlia/specs.html) that states: "Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero." – Jon Skeet May 18 '15 at 18:37
  • 1
    @JonSkeet, DHT keys actually don't have to be *hashes*. That's just whats commonly used to get randomly distributed keys. A protocol can easily do things like "take Key X, increment it by 1 and store auxiliary data under that derived key" or other numerical operations. [In kademlia] the routing table is ordered by natural numeric ordering of the IDs while lookups are ordered by XOR metric. Those may be finer details, but I wouldn't call it meaningless. – the8472 May 18 '15 at 19:05
  • @the8472: That sounds quite odd to me, particularly given the "hash" part of DHT (and the Wikipedia article). It sounds to me like again adding an interpretation of the hash as a number, and then working like that. I'll edit my answer a bit, but I still think SHA-1 values themselves should be considered as just opaque byte strings - you can add whatever interpretation you want on top, but there isn't a *natural* ordering as such. (Or to put it another way, big-endian and little-endian are just as good as each other.) – Jon Skeet May 18 '15 at 19:10
  • 1
    @JonSkeet, to clarify the semantics: A DHT is a hash table implemented on top of a routing protocol. The key-hashes map to keyspace IDs of the routing protocol which are treated more like numbers. This becomes relevant when other RPC services than simple key-value storage are provided over the overlay network. People just tend to lump it all under "DHT" and don't distinguish the layers. – the8472 May 18 '15 at 19:23
  • @the8472: But where do those numbers *come* from? As far as I can see, they're hashes to start with, just as a way of providing arbitrary-but-reproducible bits. That's what I'm saying: yes, they're *interpreted* as numbers, but the hashes themselves aren't numbers. I hope my updated answer makes that reasonably clear... – Jon Skeet May 18 '15 at 19:25
  • 1
    @JonSkeet, in a DHT designed for cryptographic use node IDs *could* directly use ECC public keys (256bits instead of 160 for reasonable security) to identify themselves, which are numbers in nature. My point is that the hashing is not part the routing protocol itself, just something commonly but not exclusively used by things that sit on top of the routing protocol. But yes, your answer is clear enough, those are mostly technicalities. – the8472 May 18 '15 at 19:36
1

You can use SequenceEqual to compare Byte arrays, check the following links for elaborate details:

How to compare two arrays of bytes

Comparing two byte arrays in .NET

Community
  • 1
  • 1
Mrinal Kamboj
  • 11,300
  • 5
  • 40
  • 74
  • Checking for just equality is easy. In my case I need to know if one hash code is greater than another one or less than another one or equal to it. Here is where the byte order of the hash gets into the game. – alex.49.98 May 18 '15 at 18:28