How to avoid wasting memory on 64-bit pointers

Question

I'm hoping for some high-level advice on how to approach a design I'm about to undertake.

The straightforward approach to my problem will result in millions and millions of pointers. On a 64-bit system these will presumably be 64-bit pointers. But as far as my application is concerned, I don't think I need more than a 32-bit address space. I would still like for the system to take advantage of 64-bit processor arithmetic, however (assuming that is what I get by running on a 64-bit system).

Further background

I'm implementing a tree-like data structure where each "node" contains an 8-byte payload, but also needs pointers to four neighboring nodes (parent, left-child, middle-child, right-child). On a 64-bit system using 64-bit pointers, this amounts to 32 bytes just for linking an 8-byte payload into the tree -- a "linking overhead" of 400%.

The data structure will contain millions of these nodes, but my application will not need much memory beyond that, so all these 64-bit pointers seem wasteful. What to do? Is there a way to use 32-bit pointers on a 64-bit system?

I've considered

Storing the payloads in an array in a way such that an index implies (and is implied by) a "tree address" and neighbors of a given index can be calculated with simple arithmetic on that index. Unfortunately this requires me to size the array according to the maximum depth of the tree, which I don't know beforehand, and it would probably incur even greater memory overhead due to empty node elements in the lower levels because not all branches of the tree go to the same depth.
Storing nodes in an array large enough to hold them all, and then using indices instead of pointers to link neighbors. AFAIK the main disadvantage here would be that each node would need the array's base address in order to find its neighbors. So they either need to store it (a million times over) or it needs to be passed around with every function call. I don't like this.
Assuming that the most-significant 32 bits of all these pointers are zero, throwing an exception if they aren't, and storing only the least-significant 32 bits. So the required pointer can be reconstructed on demand. The system is likely to use more than 4GB, but the process will never. I'm just assuming that pointers are offset from a process base-address and have no idea how safe (if at all) this would be across the common platforms (Windows, Linux, OSX).
Storing the difference between 64-bit this and the 64-bit pointer to the neighbor, assuming that this difference will be within the range of int32_t (and throwing if it isn't). Then any node can find it's neighbors by adding that offset to this.

Any advice? Regarding that last idea (which I currently feel is my best candidate) can I assume that in a process that uses less than 2GB, dynamically allocated objects will be within 2 GB of each other? Or not at all necessarily?

Is speed an issue? If not, you could write your own memory allocation function that works with 32bit pointers. For that, you could reserve an area in the memory where your data gets stored, store the adress of that space and work within that space with 32bit(or even 16bit) adresses. If you can guarantee that THIS process wont ever exceed 2GB, you could reserve that much of an amount... Also each process has its own adress space, starting at 0x00. So even if the system will take a larger amount of ram, your process will always start at 0x00 adress. — Nidhoegger, Nov 05 '15 at 09:13
@Nidhoegger That idea seems to suffer from similar problems as number 2 in my question. For the nodes to dereference those pointers it needs the base address, or a 64-bit pointer to something that does the dereferencing. Right? — Museful, Nov 05 '15 at 09:17
You would need to write your own deref function. Giving it the 32bit adress and your function will add the base adress to it — Nidhoegger, Nov 05 '15 at 09:19
*can I assume that in a process that uses less than 2GB, dynamically allocated objects will be within 2 GB of each other?* Not generally, no. On Linux, you need `mmap(..., MAP_32BIT, ...)` if you want that guarantee. In practice it happened to work for a medium size allocation (~18kiB) by a growing `std::vector` on Linux+gcc, but I wouldn't want to assume anything about `malloc` in general, especially for large allocations (like 2MiB or more = multiple hugepages). See [this super-hacky microbenchmark with 32-bit pointers zero-extending to 64-bit](https://stackoverflow.com/a/47678774/224132) — Peter Cordes, Dec 07 '17 at 23:14
See https://stackoverflow.com/questions/8156608/smaller-pointers-possible-without-a-lower-spec-system/49911194#8156608 - it's very close. — Vadim Berman, Apr 19 '18 at 03:19

score 5 · Accepted Answer · answered Nov 05 '15 at 09:35

5

Combining ideas 2 and 4 from the question, put all the nodes into a big array, and store e.g. int32_t neighborOffset = neighborIndex - thisIndex. Then you can get the neighbor from *(this+neighborOffset). This gets rid of the disadvantages/assumptions of both 2 and 4.

answered Nov 05 '15 at 09:35

Museful

6,711
5
42
68

As a bonus, this can compile efficiently on x86-64: The compiler can load the offset with `movsxd` to sign-extend to 64 bits, and then use an indexed addressing mode for `*(this+neighborOffset)` can be done with an indexed addressing mode (with scale because your offsets are in `int32_t` 4 byte elements, not in bytes). But you do need to also calculate the new current address in a register. – Peter Cordes Dec 07 '17 at 23:06
Still, pointer-chasing in this data structure is as easy as `lea rsi, [rsi + rdx*4]` / `movsxd rdx, dword [rsi]`. (It's ok to have LEA on the critical path on Intel SnB-family CPUs, where `[rsi]` has 1c less load-use latency than an indexed addressing mode. Otherwise you (or the compiler) could use an extra register to `movsxd` with an indexed addressing mode in parallel with LEA to get the new `current`). – Peter Cordes Dec 07 '17 at 23:07
Storing simple indices (not offsets) can compile *more* efficiently, assuming the recursion turns into a loop. Just a single `mov` (zero-extension of unsigned index) or `movsx` (sign-extension of signed index) to load, and the base pointer stays constant. Makes the source uglier, though, if you have to pass around the array pointer. But you do need them all in the same array anyway to make sure they're within 2GB, unless you allocate with `mmap(MAP_32BIT)`. – Peter Cordes Dec 07 '17 at 23:17

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

3

If on Linux, you might consider using (and compiling for) the x32 ABI. IMHO, this is the preferred solution for your issues.

Alternatively, don't use pointers, but indexes into a huge array (or an std::vector in C++) which could be a global or static variable. Manage a single huge heap-allocated array of nodes, and use indexes of nodes instead of pointers to nodes. So like your §2, but since the array is a global or static data you won't need to pass it everywhere.

^{(I guess that an optimizing compiler would be able to generate clever code, which could be nearly as efficient as using pointers)}

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 05 '15 at 09:17

Basile Starynkevitch

223,805
18
296
547

This is more of a short comment than an answer ;-) – Bathsheba Nov 05 '15 at 09:19
2

No, it gives solutions to the question – Basile Starynkevitch Nov 05 '15 at 09:19
Actually, in all seriousness, I like your array index idea. Have an upvote. – Bathsheba Nov 05 '15 at 09:20
So basically option 2 in my question, or forfeit cross-platform? – Museful Nov 05 '15 at 09:20

BeeOnRope · Answer 3 · 2017-12-08T04:26:55.463

You can remove the disadvantage of (2) by exploiting the alignment of memory regions to find the base address of the the array "automatically". For example, if you want to support up to 4 GB of nodes, ensure your node array starts at a 4GB boundary.

Then within a node with address addr, you can determine the address of another at index as addr & -(1UL << 32) + index.

This is kind of the "absolute" variant of the accepted solution which is "relative". One advantage of this solution is that an index always has the same meaning within a tree, whereas in the relative solution you really need the (node_address, index) pair to interpret an index (of course, you can also use the absolute indexes in the relative scenarios where it is useful). It means that when you duplicate a node, you don't need to adjust any index values it contains.

The "relative" solution also loses 1 effective index bit relative to this solution in its index since it needs to store a signed offset, so with a 32-bit index, you could only support 2^31 nodes (assuming full compression of trailing zero bits, otherwise it is only 2^31 bytes of nodes).

You can also store the base tree structure (e.g,. the pointer to the root and whatever bookkeeping your have outside of the nodes themselves) right at the 4GB address which means that any node can jump to the associated base structure without traversing all the parent pointers or whatever.

Finally, you can also exploit this alignment idea within the tree itself to "implicitly" store other pointers. For example, perhaps the parent node is stored at an N-byte aligned boundary, and then all children are stored in the same N-byte block so they know their parent "implicitly". How feasible that is depends on how dynamic your tree is, how much the fan-out varies, etc.

You can accomplish this kind of thing by writing your own allocator that uses mmap to allocate suitably aligned blocks (usually just reserve a huge amount of virtual address space and then allocate blocks of it as needed) - ether via the hint parameter or just by reserving a big enough region that you are guaranteed to get the alignment you want somewhere in the region. The need to mess around with allocators is the primary disadvantage compared to the accepted solution, but if this is the main data structure in your program it might be worth it. When you control the allocator you have other advantages too: if you know all your nodes are allocated on an 2^N-byte boundary you can "compress" your indexes further since you know the low N bits will always be zero, so with a 32-bit index you could actually store 2^(32+5) = 2^37 nodes if you knew they were 32-byte aligned.

These kind of tricks are really only feasible in 64-bit programs, with the huge amount of virtual address space available, so in a way 64-bit giveth and also taketh away.

Bathsheba · Answer 4 · 2015-11-05T09:18:43.973

-1

Your assertion that a 64 bit system necessarily has to have 64 bit pointers is not correct. The C++ standard makes no such assertion.

In fact, different pointer types can be different sizes: sizeof(double*) might not be the same as sizeof(int*).

Short answer: don't make any assumptions about the sizes of any C++ pointer.

Sounds like to me that you want to build you own memory management framework.

edited Nov 05 '15 at 09:18

answered Nov 05 '15 at 09:17

Bathsheba

231,907
34
361
483

1

This is more a comment than an answer. – Basile Starynkevitch Nov 05 '15 at 09:18
A pretty big comment..., and I do make a recommendation! – Bathsheba Nov 05 '15 at 09:18
What's the practical use of _not make any assumptions about the size of any C++ pointer_? Just ignore it entirely? You can check the `sizeof` and outside of exotic platforms, it will be as you expect. If its not, then you take another approach, but this is a practical problem with practical solutions. – BeeOnRope Dec 08 '17 at 04:02

How to avoid wasting memory on 64-bit pointers

Further background

4 Answers4