9

In the Kademlia paper, the last paragraph of section 2.4 states that in order to properly handle highly unbalanced trees...

Kademlia nodes keep all valid contacts in a subtree of size at least k nodes, even if this requires splitting buckets in which the node's own ID does not reside.

However, the previous section of the paper seem to state that if a k-bucket already has k elements, that any further additions in to that k-bucket require removing the stalest node (pinging it first to see if its alive) or otherwise caching the addition until a slot becomes available in that k-bucket.

The paper seems to be contradicting itself with these two points.

Under what conditions should a k-bucket be split and why? It seems not practical to keep "all valid contacts" in the routing table, as the routing table would then get very large very quickly. The example talks about a tree that has many nodes starting with 001 and one node starting with 000. The node starting with 000 would have to continually split its k-bucket for 001 to hold every valid node starting with 001? In a 160-bit address space, wouldn't that end up potentially storing 2^157 nodes in 000's routing table??

The wording in the quoted block is also very confusing...

"in a subtree" -- in which subtree of the routing table?

"of size atleast k nodes" -- what metric are we using to determine size of the subtree? nodes in this case refers to kademlia nodes or k-buckets or something else?

offbynull
  • 381
  • 3
  • 16

1 Answers1

6

However, the previous section of the paper seem to state that if a k-bucket already has k elements, that any further additions in to that k-bucket require removing the stalest node (pinging it first to see if its alive) or otherwise caching the addition until a slot becomes available in that k-bucket.

That is how a bucket is maintained whenever there is a node contact to insert but the bucket is not eligible for splitting.

Under what conditions should a k-bucket be split and why?

As a first approximation: Split a bucket whenever a new node cannot be inserted and the bucket's ID-space covers your node ID.

This is necessary to maintain full awareness of your neighborhood while having only vague awareness of remote keyspace portions. I.e. for locality.

To cover the unbalanced-tree case - which may happen either if node IDs aren't (pseudo-)random or, at least in leaf buckets, due to statistical flukes when they are assigning at random - the approach has to be relaxed as follows:

When

  • attempting to insert a new contact C in the routing table
  • the bucket into which C belongs is full
  • C is closer to your node ID than the Kth-closest node in your routing table, where K is the bucket size

then split the bucket.

In practice this has to be modified a little further so that relaxed splitting is used for responses while unsolicited requests should only use strict splitting, otherwise you could get some strangely distorted routing table when the relaxed splitting happens early during startup when the table isn't populated yet.

And of course this also needs to be taken into account when merging buckets so that buckets created by relaxed-splits don't get merged back into a single non-home bucket as long as there are no closer nodes in the routing table.


One of the improvements proposed in the S/Kademlia paper is using a sibling list insead to keep track of a node's nearest neighbors which may be easier to implement than non-local splitting.

the8472
  • 40,999
  • 5
  • 70
  • 122
  • Thank you for this answer. I've been struggling to understand this for the past week. Is what you've outlined here what was stated in the original paper? – offbynull Aug 24 '15 at 20:48
  • 5
    Difficult question. I would say that it is an implementation approach that is equivalent to the intent of the respective paragraphs in the paper. Someone else actually brought up this part on IRC a few months ago and I had to think quite some time until I actually got it. I have reasoned incorrectly about Kademlia before, so I am not willing to assert correctness for my interpretation. That said, real world implementations are often far less precise (only implementing the 1st approximation) and still seem to work as long as IDs are uniformly distributed. It just makes them noisier. – the8472 Aug 24 '15 at 21:10
  • 1
    I am struggling with the original paper too. It seems to contradict itself. In one section it talks about 160 distance based k buckets, in another a key based binary tree of k buckets whose size is dynamic. Is there anything out there on the web that can translate that thing into plain English? – Sentinel Nov 26 '17 at 23:57
  • @Sentinel the comment section is not appropriate for such explanations. you should make a new question or look through existing ones – the8472 Nov 27 '17 at 07:54
  • Hi @the8472, when you refer to the "Kth-closest node in your routing table" are you referring to the last node found in your other answer on how to implement finding nodes? Thank you for answering all these questions they're extremely helpful. I'm trying to implement Kademlia in C# myself. What IRC channel were you referring to? Is there a place actively discuss Kademlia? – masterwok Sep 03 '19 at 20:03
  • 1
    The algorithm how you enumerate a set of *N* nodes closest to some target ID *T* doesn't matter to this answer. But yes, I outlined such an algorithm in another answer. In this special case it is *T* = your own node ID. – the8472 Sep 03 '19 at 22:48
  • 1
    @masterwok the irc channel would be `##bittorrent` on freenode. – the8472 Apr 12 '20 at 12:52
  • Would this solve the original issue in the paper? There are more than k nodes with prefix 001 and only one new node with prefix 000. Let's call the node with prefix 000 node u. we want u to be in all the nodes with prefix 001's routing table. In this strategy, suppose node u discovers nodes in increasing order of distance, then the k+1th node will not be inserted into the table. – sukunrt Jan 07 '23 at 06:07