50

I have a project in which I have to achieve fast search, insert and delete operations on data ranging from megabytes to terabytes. I had been studying data structures of late and analyzing them. Being specific, I want to introduce 3 cases and ask questions on that:

  1. The data is much more than what the memory can handle (sample ranges in 10-15 terabytes) at one go. In this case, I would store the data structure on a disk.

  2. The data is relatively less compared to the memory of the system and thus it can be stored and operated in the memory itself for speed.

  3. The data is more than free memory and assume it is less than the size of a possible contiguous chunk of data in the paging file. thus I store the data structure in a file on disk and do a memory mapping of the file.

The conclusions I have drawn are:

For case 1, I should use a B-Tree for faster access as it saves on lag produced by disk rotation

For case 2, I should use a Red Black Tree for faster access as data is on memory and no. of elements needed to be scanned in worse case would be lesser than one i have to do if I use B Trees

For case 3, I am doubtful on this one, the page file is on disk uses native OS I/O to operate on files, so should B Tree be a better option or a Red Black tree?

I want to know where the above three conclusions go right and where it goes wrong and how I can improve on performance in the three separate cases.

I am using the C++ Language, with a red black tree and a B tree, both which I have designed from scratch. I am using Boost library for File mapping.

Update 1:: Was reading through this post in stackoverflow. Got some real good insights, which make me feel that the type of comparisons I have done in the cases may be faulty. A link was posted in the most-voted-for-answer http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html

Community
  • 1
  • 1
swanar
  • 635
  • 1
  • 6
  • 10
  • 2
    What kind of search are you going to do? Simple search by key? How does the key look like? – svick Jun 19 '11 at 13:22
  • You're more or less correct. Go on with the implementation, Ask here if you get stuck. – nikhil Jun 19 '11 at 13:44
  • @svick Yes I am doing simple search by key, in the most general way, they could be a discreet, or in numerically continuous order, set of distinct natural numbers starting from 1 to say a value like (2^8)-1 – swanar Jun 19 '11 at 19:44

2 Answers2

28

A red/black tree is more or less equivalent to a 2-3-4 tree, which is just a type of B-tree. The worst-case performance is identical, provided you do a binary search of the B-tree node values.

The obvious disadvantage of a B-tree is wasted space, but depending on the language/memory allocator used, you may find that a 2-3-4 tree uses less space than a red-black tree on average. For instance, in 32-bit Java, there's approximately an 8-byte overhead per object. (It also depends a lot on the allocator; IIRC phkmalloc rounds up small allocations to a power-of-2 size.)

To answer your cases,

  1. Disk latency is roughly evenly split between seek time and waiting for the disk to rotate.
  2. A B-tree should be able to outperform a red-black tree if you're doing it right (in particular, a B-tree should be faster if nodes fit into a cacheline.)
  3. It doesn't need to be contiguous in the page file; it merely needs to be contiguous in the process's virtual address space. For sane OSes, it's also pretty much identical to case 1, unless your data is small enough that it mostly fits into memory and the memcpy overhead is significant.

For simplicity, I'd go with a B-tree and run some benchmarks on various node sizes.

tc.
  • 33,468
  • 5
  • 78
  • 96
  • Thankyou very much for the inputs; Would you suggest going with a 2-3-4 tree even when the data set is large? Wouldn't it be better if the node sizes are similar to page size in the disk? You do have strong points supporting 2-3-4 tree as a substitute to a Red Black trees though – swanar Jun 19 '11 at 19:50
  • 1
    I did say "run some benchmarks on various node sizes". The advantage of just using a B-tree is that you can run some benchmarks and tune it to your liking. You also might want to think about data locality (i.e. if your keys are strings, you probably want to keep the strings near the nodes). If paging is the slow bit, you definitely want nodes at least as big as the page size, but probably bigger (assuming your disk does readahead). And then answer is different again for SSDs... – tc. Jun 20 '11 at 00:08
1

Robert Sedgewick's experimental data(freely available online) shows that the red capacity is constantly moving up the black tree,and swapped out to the black tree top, the quadratic cost of which is identical to other tree algorithms.

The red black tree algorithm suffers from global red waterfall(red fountain).

B-tree and red black tree has the same fundamental quadratic swapping cost, which is exactly identical to a very simple static list sorting.

In order to visualize the invisible hidden entropy flow, you must interpret everything as the generalized red-green-black tree. Each branch in the 6-B-tree (or larger) is decomposed this way.

All the extra capacity must be interpreted as green. The central node must be black, and the outermost nodes must be colored red.

The central black node is the invariant support (backbone) and it must not be touched except when explicitly needed. The outermost red nodes must be empty, because they are the main swap-out channels.

The red Carnot channel must be always larger than green channel; the green channel exceeding 1/2 makes any swapping impossible.

Because this is an abstract entropy flow, you must ignore individual permutations and extract only the capacity flow; energy transparently flows through random particles.

Because the green capacity condensates down to the root, the red capacity is constantly pushed up to the tree top by the green border, which is the global red waterfall.

The B-tree swap-out channel is upside-down. You must recursively merge-split 2 vertical black rows from top to root to generate the basic Carnot swap-out flow. The strong restriction to the black node motion increases the extra redundant red motion cost. If you try to optimize one subsystem, the entropic cost naturally flows out to the other subsystem.

The quadratic swapping cost of binary tree is identical to the cost of swapping out the abstract height(volume) growth, which is a simple gravity. The shape of the binary tree is exactly identical to the logarithmic Newton gravity visualization.

If you have a very large static list, the swapping cost is 0. Such a list is always as large as the sample space. The compression of the large list always have the Shannon-Huffman entropic cost. This is a very simple hidden Carnot I/O cost.

If you put all the buckets together, you get a very simple static list, which is well known as the Young diagram. This static list is far too small, so it always has an invariant swapping cost. Any tree is a very simple 2-dimensional static list. The cost of compressing the height of a 2-dimentional Young diagram is always invariant.

The main cost of tree algorithms is not the algorithm cost(green cost); their main cost is the extra hidden Carnot Swapout I/O, which requires local full sorting with the large list comression-decompression(red cost).

  • The binary tree has a fundamental asymmetry; because the head is always heavy, it absorbs more red heat. The black tree top is always cool, which is a very simple heatbath(red heat absorver). – cheeseandtomato Oct 15 '21 at 12:02
  • Every single red absorption moves down exactly one black hole, which moves the red capacty down. – cheeseandtomato Oct 16 '21 at 03:12
  • The green capacity must move down to the root, because the 6-B-tree algorithm tries to auto-pack everything in order to compress the cost. This compression packs the green capacity and moves the green heat to the root. The green capacity is a free particle;the swap-in cost is not expensive. – cheeseandtomato Oct 16 '21 at 23:10
  • Large-width B trees suffer from a very simple Foehn phenomenon. Because the swap-out channel shrinks narrower and narrower, it generates massive red compression. – cheeseandtomato Oct 17 '21 at 00:44
  • This is the greatest Fake reply of all time – xy2 Jul 03 '23 at 18:27