9

I'm in the middle of implementing my own dht for internal cluster. Since it will be used in file-sharing program like bittorrent, "Mainline DHT" was the first thing I was look at. After that I found "entangled" (python, dht using twisted matrix), congress (python, dht using pyev + libev) and of course original "kademlia".

They have different approaches on organizing k-buckets:

1) congress, kademlia use fixed 160 buckets in range 2*i <= (difference for each id from us) < 2*(i+1), for 0 <= i < 160.

2) mainline DHT and entangled use dynamic buckets. On start they have just 1 bucket covering whole space. After it will be filled with 8 alive nodes, bucket will be splitted to 2 new. But ONLY if our own id inside that bucket. If it is not -- bucket will be never splitted. So, soon we will have 160 closest to us buckets and few other.

Both variants are good enough. But I have found HUGE difference in logic which detects belongs some id to some bucket or not. And this is my question.

congress and kademlia treat bucket bundaries as "minimum distance from us" and "maximum distance from us". So, our own ID will be ALWAYS in bucket0. Maximum 2 other ids in bucket1 (because it covers 2*1 <= x < 2*2 distances) will be ALWAYS closest to us. So my brain does not breaks, coz everything OK.

But if you look into Mainline DHT or entangled, you will see what bucket bundaries treated as absolute node id bundaries, not xor distance! So in theoretically full table ids 0,1,2,3,4,5,6,7 will be in 1 bucket.

So. Why some implementations treat bucket boundaries as "max/min distance from us", while others as "max/min 160bit integer value"??

Vadim Fint
  • 875
  • 8
  • 9
  • If *every* node had ids 0-7 in the first bucket the implementation could not possibly work. Which id goes in which bucket depends completely on the node's id, otherwise high ids would be impossible to find. Maybe you just missed the distance calculation? How and why the distance calculation works is described [in the paper](http://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf) - entangled does it exactly right. – Jochen Ritzel Oct 09 '11 at 02:25
  • As far as I can see in entanled and kademlia sources, they do that differently. For 160 bit id space: Kademlia splits space to 160 buckets, there 0 bucket has ids which are 2**0 <= distance_from_us < 2**1. 1 bucket has ids which are 2**1 <= distance_from_us 2**2 and so on. Because 160 bits is a LOT, ids usually are not very close to each other by absolute integer value. So, if our own id first bucket0, that means what several few next buffers will be always empty, while closest peers will be in e.g. bucket #10. Last bucket can contain half of all ids in the bit space. – Vadim Fint Oct 09 '11 at 17:02
  • But entangled do this: 1) on start we have 1 bucket which has ids 0 <= id <= 2**160. Obvously our own id is also there. After we fill k (=8) nodes into that bucket, we split them to new buckets 0 <= id < 2**159-1 and 2**159 <= id < 2**160-1. We will be splitting only bucket with our own id, not others. Dont you feel strange -- why original kademlia organizes buckets via comparing nodes to distance, while entangled compares integer ids against bucket boundaries? – Vadim Fint Oct 09 '11 at 17:08

1 Answers1

10

The kademlia paper actually calls out the optimization of dynamically splitting buckets as the routing table grows. There is no logic difference between these two approaches, it's just an optimization to save some space. When implementing a fixed full sized routing table, you have to find k nodes to send requests to. If the bucket your target falls in is empty, or has fewer than k nodes in it, you have to pick from neighboring buckets. Given that, have the closest bucket to you not be split in the first place, makes that search simpler and faster.

as for your point (1), I think you may have misunderstood kademlia. The routing table bucket boundaries are always relative your own node ID. And the ID space the buckets span double for each bucket further away from you. Without this property (if, say each bucket covered an equal range of the ID space) you would not be able to do searches properly, and they would certainly not be log(n).

The mainline DHT implements kademlia.

Arvid
  • 10,915
  • 1
  • 32
  • 40
  • It's a lot of effort for some minor optimization. Since the maximum number of peers is bounded by the bits in the key, and k, is there anything wrong with just searching `(160*8)` node slots (per BT's kademlia parameters)? – Matt Joiner Mar 09 '12 at 13:59
  • @Matt Joiner (or others): I'm stuck with this too. Is this optimization purely an optimization in terms of a bit of a node's memory and CPU time or does it actually improve the system as a whole (by reducing latency or bandwidth)? – orlp Dec 11 '12 at 11:51
  • @nightcracker: I'm not sure. I suspect the whole bucket part is some academic wankery, there's no reason you can't arrive at your own algorithm to pick the nearest nodes. – Matt Joiner Dec 11 '12 at 13:34
  • 1
    if you want to go about and implement some other kind of routing table, keep in mind that there are certain properties it still need to have in order for the network as a whole to have O(log n) lookups. I doubt that you could come up with something that works and is much simpler though. XOR and counting 0-bits is essentially all you need to do. – Arvid Dec 14 '12 at 06:15
  • 1
    @Arvid: Surely you mean count *leading* 0-bits after an XOR? – Matt Joiner Dec 18 '12 at 11:28