22

In the Kademlia paper by Petar Maymounkov and David Mazières, it is said that the XOR distance is a valid non-Euclidian metric with limited explanations as to why each of the properties of a valid metric are necessary or interesting, namely:

  • d(x,x) = 0
  • d(x,y) > 0, if x != y
  • forall x,y : d(x,y) = d(y,x) -- symmetry
  • d(x,z) <= d(x,y) + d(y,z) -- triangle inequality

Why is it important for a metric to have these properties in general? Why is each of these properties necessary in the context of routing queries in the Kademlia Distributed Hash Table implementation?

In addition, the paper mentions that unidirectionality (for a given x, and a distance l, there exist only a single y for which d(x,y) = l) guarantees that all queries will converge along the same path. Why is that so?

Erick
  • 305
  • 2
  • 7

3 Answers3

15

I can only speak for Kademlia, maybe someone else can provide a more general answer. In the meantime...

  • d(x,x) = 0
  • d(x,y) > 0, if x != y

These two points together effectively mean that the closest point to x is x itself; every other point is further away. (This may seem intuitive, but other aspects of the XOR metric aren't.)

In the context of Kademlia, this is important since a lookup for node with ID x will yield that node as the closest. It would be awkward if that were not the case, since a search converging towards x might not find node x.

  • forall x,y : d(x,y) = d(y,x)

The structure of the Kademlia routing table is such that nodes maintain detailed knowledge of the address space closest to them, and exponentially decreasing knowledge of more distant address space. In short, a node tries to keep all the k closest contacts it hears about.

The symmetry is useful since it means that each of these closest contacts will be maintaining detailed knowledge of a similar part of the address space, rather than a remote part.

If we didn't have this property, it might be helpful to think of the search as more like the hands of a clock moving in one direction round a clockface. The node at 1 o'clock (Node1) is close to Node2 at 2 o'clock (30°), but Node2 is far from Node1 (330°). So imagine we're looking for the two closest to 3 o'clock (i.e. Node1 and Node2). If the search reaches Node2, it won't know about Node1 since it's far away. The whole lookup and topology would have to change.

  • d(x,z) <= d(x,y) + d(y,z)

If this weren't the case, it would be impossible for a node to know which contacts from its routing table to return during a lookup. It would know the k closest to the target, but there would be no guarantee that one of the other more distant contacts wouldn't yield a shorter overall path.

Because of this property and unidirectionality, different searches starting from vastly separated points will tend to converge down the same path.

The unidirectionality means that no two nodes can have the same distance from a given point. If that weren't the case, then the target point could be encircled by a bunch of nodes all the same distance from it. Then various different searches would be free pick any of those to pass through. However, unidirectionality guarantees that exactly one of this bunch will be the closest, and any search which chooses between this group will always select the same one.

Fraser
  • 74,704
  • 20
  • 238
  • 215
7

I've been bashing my head on this for quite some time: how can the XOR - as in the number of differing bits, a proper Hamming distance - be the basis of a total order?

Well it can't, such a metric on its own is not enough for a comparable relationship, all it can do is dump nodes in circles around a point.

Then I read the paper more closely and noticed that it says "the XOR as an integer value" and it dawned on me: the crux is not the "XOR metric", but the length of the common prefix of the ID (of which XOR is a derivation mechanism.)

Take two nodes with the same Hamming distance from "self" and the length of their prefix common to "self": the one with shortest common prefix is the furthest node.

The paper uses "XOR distance metric" but it really should read "ID prefix length total ordering"

Eddy
  • 1,662
  • 2
  • 21
  • 36
6

I think this may explain it a wee bit, let me know http://metaquestions.me/2014/08/01/shortest-distance-between-two-points-is-not-always-a-straight-line/

Basically each hop if it were only one bit at a time in a fully populated network (extreme) then would have twice the knowledge of the previous hop. As you converge the knowledge is greater until you get to the closest nodes whose knowledge is ultimate in the network.

dirvine
  • 57,399
  • 2
  • 18
  • 19