5

I have a huge tree that can take up to several gigabytes. The node structure is as below. You'll notice that I made the last member an array of size 1. The reason for this is that I can over-allocate a Node with flexible sizes. Similar to what C natively supports as a flexible array member. I could use std::unique_ptr<T[]> or std::vector<T> instead, but the problem is that then there is double dynamic allocation, double indirection, and extra cache misses per each tree node. In my last test, this made my program take about 50% more time, which is simply not acceptable for my application.

template<typename T>
class Node
{
public:
  Node<T> *parent;
  Node<T> *child;

  /* ... */

  T &operator[](int);
private;
  int size;
  T array[1];
};

The simplest way to implement operator[] would be this.

template<typename T>
T &Node::operator[](int n)
{
  return array[n];
}

It should work fine in most sane C++ implementations. But as the C++ standard allows insane implementations doing array bounds checking, as fas as I know this is technically invoking undefined behaviour. Then can I do this?

template<typename T>
T &Node::operator[](int n)
{
  return (&array[0])[n];
}

I'm a little confused here. The [] operator for primitive types is just a syntactic sugar to *. Thus (&array[0])[n] is equivalent to (&*(array + 0))[n], which I think can be cleaned up as array[n], making everything the same as the first one. Okay but I can still do this.

template<typename T>
T &Node::operator[](int n)
{
  return *(reinterpret_cast<T *>(reinterpret_cast<char *>(this) + offsetof(Node<T>, array)) + n);
}

I hope I'm now free from the possible undefined behaviours. Perhaps inline assembly will show my intent better. But do I really have to go this far? Can someone clarify things to me?

By the way T is always a POD type. The whole Node is also POD.

  • Maybe you should look into various forms of sparse arrays. http://stackoverflow.com/a/21312608/193892 There are many implementations and libraries for C and C++ https://stackoverflow.com/questions/4306/what-is-the-best-way-to-create-a-sparse-array-in-c – Prof. Falken Sep 03 '15 at 06:54
  • @xiver77 Could you elaborate on the problems you expect when using std::vector? What about using a custom allocator? In the end, this may be just a case of space vs. time complexity optimization. – Baiz Sep 03 '15 at 07:43
  • @Baiz Quoting myself, "I could use std::unique_ptr or std::vector instead, but the problem is that then there is double dynamic allocation, double indirection, and extra cache misses per each tree node. In my last test, this made my program take about 50% more time, which is simply not acceptable for my application." –  Sep 03 '15 at 07:55
  • @Baiz using std::vector is both space and time inefficient by the way. Custom allocator does not help. If it can, please show me how. –  Sep 03 '15 at 07:57
  • Just use your first implementation and check the asm to ensure it's doing what you expect. Also since you've got Gb's of data you should try to allocate linked nodes in adjacent memory so that walking the nodes doesn't cause an unnecessary CPU cache reload. – Andy Brown Sep 03 '15 at 08:07
  • So, basically, you are worried that your access with `array[n]` in `operator []` is UB, and want to solve that? – Mats Petersson Sep 03 '15 at 08:08
  • @AndyBrown Sounds like a good solution actually. About your suggested optimization, the problem is that the actual tree structure I'm using has many children per node, and frequently selects one of the top node's children, making that the top node and deallocating all unreferenced nodes that were the recursive children of the unselected nodes. –  Sep 03 '15 at 08:26
  • @AndyBrown But now I realize that maybe looking at the asm isn't that simple. I actually tried right now and it was impossible to find the relevant part in the fully optimized function-inlined assembly. –  Sep 03 '15 at 08:30
  • @xiver77 I'm aware of what you have written, but some of those things could be assumtions on your side, not facts. As you haven't provided your implementation, we'll have to assume as well. 1. `std::vector` is not both space and time inefficient, especially not if you're smart about preallocation. It also totally depends on how often and how many elements are added. 2. How often are nodes changed and how often values of type T accessed? – Baiz Sep 03 '15 at 08:37
  • 3. With "double dynamic allocation", do you mean the allocation for the NodeObject and the subsequent allocation for the vector? I doubt that this will cost you +50% time. 4. With "double indirection", do you mean that accessing an element of type T will go via your class interface and over that of the vector? If the size of the vector will not change then this won't be a problem as you could simply store a poitner to the vectors memory and use offsets in the getter-method. – Baiz Sep 03 '15 at 08:38
  • 5. Am I correct in the assumption that when creating a node, you know how many elements of type T it will have to hold? Otherwise, the intent of using a flexible array member escapes me. 6. It's probably more interesting how a struct's array is filled with values. – Baiz Sep 03 '15 at 08:38
  • I'm sure I read somewhere about a proposal to add full C++ support for variable length arrays, including full support for calling all destructors correctly and so on. But I can't find it now. Anybody have a link? – Aaron McDaid Sep 03 '15 at 08:45
  • 2
    @Baiz Oh, I understand. Every node is fixed of its size after allocation. No reallocation for the nodes. –  Sep 03 '15 at 08:45

3 Answers3

2

First of all, an implementation is free to reorder class members in all but trivial cases. Your case is not trivial because it has access specifiers. Unless you make your class POD, or whatever it's called in C++11 (trivial layout?) you are not guaranteed your array is actually laid out last.

Then of course flexible members do not exist in C++.

All is not lost however. Allocate a chunk of memory large enough to house both your class and your array, then placement-new your class in the beginning, and interpret the portion that comes after the object (plus any paddibg to ensure proper alignment) as the array.

If you have this, then the array can be accessed with

reinterpret_cast<T*>(
 reinterpret_cast<char*>(this) +
 sizeof(*this) + padding))

where adfing is chosen such that sizeof(T) divides sizeof(*this) + padding.

For inspiration, look at std::make_shared`. It also packs two objects into one allocated block of memory.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • One could add that one could change `T array[1]` to `T *const array` and initialize it to the space after the object (in the constructor). But one could prefer to have a method that returns that pointer instead (and avoid wasting precious memory for storing the pointer). – skyking Sep 03 '15 at 09:39
  • Are you sure that my `Node` is not a POD? –  Sep 03 '15 at 10:47
  • Where does your argument come from that implementations can freely reorder class data members? If the `array` member is not guaranteed to be laid out last, that's certainly breaking a huge part of C compatibility. –  Sep 03 '15 at 10:50
  • Okay I see here http://en.cppreference.com/w/cpp/language/data_members#Standard_layout, that all non-static data members should have the same access control. Maybe that's why you've mentioned about access specifiers. I'm really confused here. Could you explain the problem in more detail? –  Sep 03 '15 at 10:56
  • Now I see that the C++ standard has guarantee in member orders in memory only when the access specifier is unique. Kind of makes me hate C++ even more, but still good to know. I get your solution, but I'd also like an explanation when all members are public, so that `array` does indeed come last. –  Sep 03 '15 at 11:06
  • Right, to be C-compatible all non-static data members of a class should have the same access control. Why is this a problem? – n. m. could be an AI Sep 03 '15 at 11:10
  • @n.m. It is a problem because it's not so obvious that adding a single private or public between the data members can mess up their order in memory. –  Sep 03 '15 at 11:11
  • But once you know it, it's not a problem any more, – n. m. could be an AI Sep 03 '15 at 11:16
  • If all members are public, there are still no flexible array members in C++. – n. m. could be an AI Sep 03 '15 at 11:20
  • Yes there are no flexible array members in C++. But using an array of length 1 as the last member is a common pattern in C even nowadays when it does have flexible array members. For exmaple Windows API code does this, and such code needs to go through the VC++ compiler. MSalters in his comment in his answer also emphasized this point that C++ had to keep compatibility of existing C code. Apart from the actual widely-used C++ implementations. I was wondering about how the C++ standard cares about this. Perhaps I'm thinking too much about this at this point. –  Sep 03 '15 at 12:01
  • Well you should ask yourself why flexible arrays were added to C. Rather than legalizing existing practice, the standard went to the trouble of creating something entirely new. I suppose it's because the practice, however widespread, violates the standard so hard that legalizing it wasn't worth the effort. But you decide for yourself. – n. m. could be an AI Sep 03 '15 at 12:19
  • In my opinion the C and C++ standard committee should have given up the possibility of array bound checking at all, and add something like a 0-sized array. Array bound checking have no use in optimized executables, and during development, debugging tools can already do this nicely. Anyway I'm accepting your answer. –  Sep 03 '15 at 12:47
  • The implementation is required to allocate class members at increasing addresses except when crossing an access specifier. That's not "all but trivial cases." – Potatoswatter Sep 04 '15 at 04:32
1

The main problem with "out of bounds" array access is that no object lives there. It's not the out of bounds index itself which causes the problem. Now in your case there presumably is raw memory at the intended location. That means you can in fact create a POD object there via assignment. Any subsequent read access will find the object there.

The root cause is that C didn't really have array bounds. a[n] is just *(a+n), by definition. So the first two proposed forms are already identical.

I'd be slightly more worried about any padding behind T array[1], which you'd be accessing as part of array[1].

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • In simpler terms, do you mean there is a problem around destructors being called? That the destructor for `array[0]` will be called, but not the later elements? – Aaron McDaid Sep 03 '15 at 08:41
  • 1
    @AaronMcDaid: No, because a class with a (non-trivial) destructor wouldn't be a POD. The code above would indeed fail horribly for `T==std::string`, probably already at initialization. – MSalters Sep 03 '15 at 08:44
  • Ah yes, but with careful use of placement-new, it could be constructed correctly? – Aaron McDaid Sep 03 '15 at 08:55
  • Are you sure that out-of-bounds array access is okay for PODs when allocated memory is there? Quoting the standard, `5.7.4 ... If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.` –  Sep 03 '15 at 08:58
  • @xiver77: Yes. There's tons of existing legal C code which called `malloc` and then used that raw memory to store objects. The "POD" concept in C++98 was introduced to keep such existing C code legal in C++. – MSalters Sep 03 '15 at 09:13
  • @AaronMcDaid: Indeed, that's how you'd create a normal (non-POD) object in raw memory. That such raw memory happens to lie right behind another object is a trivial detail. – MSalters Sep 03 '15 at 09:15
0

You also wondered if there was an alternative approach. Given your recent comment about "no reallocation", I'd store the array data as a pointer to heap-allocated storage, but:

Trees has predictable access patterns, from root to child. Therefore, I'd have a Node::operator new and make sure that child nodes are allocated directly after their parent. This gives you locality of reference when walking the tree. Secondly, I'd have another allocator for the array data, and make this return contiguous memory for the parent array and their first child (followed by its first grandchild of course).

The result is that the node and its array have no locality of reference between them, but instead you get locality of reference both for the tree graph and the associated array data.

It's quite possible that the array data allocator can be a trivial pool allocator for the tree. Just allocate 256 KB chunks at a time, and parcel them out a few ints at a time. The whole state you need to track is how much of that 256 kB you've already allocated. This is much faster than std::vector<T, std::allocator> can achieve because it cannot know the memory lives for as long as the tree lives.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Although you didn't answer my question then, you're hint about the tree residing in contiguous memory was indeed correct. I can model the whole tree into a contiguous array, and now it's also possible to apply very interesting compile-time optimizations via templates, as things are more deterministic than when dynamically allocating each nodes. Thank you. –  Sep 08 '15 at 05:20