Should data in B+Tree be ordered and how much data should I be loading at a time?

Question

Within my B+Tree I have ordered keys at the leaf levels, with data pointers to a separate data file. This question refers to the order of data within this data file.

As I see it, When ordered:

It's Easier to load all the data in a block when one is read, as data is stored in the same order as its keys, so you only have to check if adjacent data pointers are in the same block.
Less reads when accessing a lot of adjacent data or performing ranges, as data is more likely to be contained in the same block.
Increased fragmentation and writes when deleting/splitting/inserting

when not ordered:

Decreased writes when inserting, as data is just appended to the end of the last block associated with the node.
Increased reads when performing ranges, as data is less likely to be split between multiple blocks.
It's a lot slower to find the entry that other data belongs to within the same block, as you have to loop through all the entries in the nodes checking their data pointers.

Alternatively, should I just load entire nodes into memory when I need to access data from an entry within that node?

Looking for a second opionion on the best way I should be storing data (ordered/unordered) and how many data pointers should I be loading when performing a simple "get" for one value?

Thanks!

trincot · Accepted Answer · 2021-11-13T17:13:03.970

Alternatively, should I just load entire nodes into memory when I need to access data from an entry within that node?

Absolutely. That is the idea behind the B-tree family of data structures. It is not intended for reading/writing partial nodes (blocks) from/to slow storage. When a node is needed, it is read in its entirety into memory (if not already loaded), manipulated, and written back entirely to persist it.

As the manipulation of the data in the node itself happens in memory, the choice of whether to keep the node's content sorted or not is of less importance. The read/write operations to the slow(er) storage will be much more determining for the overall performance.

Making considerations of the time complexity of either choice are irrelevant too, since there is a preset maximum to the number of keys in a node. So all single-node operations can be considered to have a constant time complexity.

You could time the different implementations and see if that leads to a clear choice. This choice will depend on the degree of the B-tree.

"When a node is needed, it is read in its entirety into memory (if not already loaded), manipulated, and written back entirely to persist it." Even when nodes can extend multiple blocks with variable length values? — roat, Nov 13 '21 at 21:01
The values are payload. You could have the keys combined with *addresses* of where the actual values are stored (in different blocks). The traversal through the tree, and the search within a node, concerns *keys*. In case key/value pairs are shifted within a node, this concerns the key and the corresponding value *address*. See [this image on Wikipedia](https://en.wikipedia.org/wiki/File:Bplustree.png) — trincot, Nov 13 '21 at 21:21

Should data in B+Tree be ordered and how much data should I be loading at a time?

1 Answers1