A single node of a 2-3-4 tree could be constructed with 8 pointers: pointers to up to four child nodes, pointers to up to 3 actual records containing keys that will either match a search key or will determine which of 4 child nodes to recurse to, and a parent node pointer.
Common hardware today has 8-byte pointers, giving a 64-byte node. Further, modern CPUs have 64-byte cache lines. Should the nodes be aligned with the cache lines, then each node requires only one cache line hit: after referring to the first of seven pointers, all the rest will be in your L1 cache.
While a red-black tree is far simpler to implement, and small code should be fast code, each level of descent in the tree risks an L1 cache miss. For 1023 objects, a 2-3-4 tree needs a a worst-case of 5 nodes to be loaded into cache. A perfectly-balanced binary tree would need 10, but due to imbalance a Red-Black may need more (not sure the worst case: 20?)
Small test harnesses that simply hammer at one data structure will probably keep it all in cache, and so may report the Red-Black tree as being similar performance to the 2-3-4. But I have a feeling that a complicated real-world application may see much less wall-clock time and lower latency with 2-3-4 trees.
Is there any consensus or research on this?