6

Does any C++ library implements something like Haskell Data.Sequence container?

I'm mostly interested in:

  1. Maintaining the elements order (In which they were inserted).
  2. O(logn) access by an index. Aka operator[](size_type pos).
  3. O(logn) insertion/deletion in the middle (by an index).
willir
  • 599
  • 1
  • 4
  • 15
  • A deque might be good enough, though it doesn't fully meet the third requirement. – chris Jan 07 '18 at 17:57
  • @chris It doesn't meet the second one at all –  Jan 07 '18 at 17:58
  • Possibly a set where the set element is a type that maintains both a value and an insertion ordering. –  Jan 07 '18 at 17:59
  • @Angnew Access to a deque is only O(1) if you know in advance where to look. Otherwise it is O(N). –  Jan 07 '18 at 17:59
  • @NeilButterworth, Fair, I didn't think that through when looking at cppreference. Amortized constant might also be good enough, or it might not. – chris Jan 07 '18 at 18:08
  • Are you looking for something immutable (as in `Data.Sequence`) or something mutable? – dfeuer Jan 07 '18 at 18:08
  • Well, yeah, `std::deque` doesn't meet the third requirement, though it meets others. It has `O(1)` access by an index, it is even better than `O(logn)`. I'm not interested in searching element, just getting it by an index. However `std::deque` has `O(n)` insert in the middle. – willir Jan 07 '18 at 18:09
  • You can implement binary search tree that meets all requirements, that is roughly speaking how `Data.Sequence` is implemented (It is actually Finger Tree, as far as I know). Rope structure has similar properties, and you can find Rope for `C++`. I just wonder, is there anything like Generic Rope (for any element type). – willir Jan 07 '18 at 18:09
  • @dfeuer mutable. – willir Jan 07 '18 at 18:10
  • How about a `std::map` where you map the insertion order (an int that you need to keep track of) to each datum? A`std::map` is a tree, so meets all requirements. – Cris Luengo Jan 08 '18 at 00:49
  • It would work only if I know all indexes of elements beforehand. But it is rarely the case, otherwise, I wouldn't need to insert into the middle, I would just sort by index and use `std::vector` or so. I mean in `Data.Sequence` if you have two elements, their indexes: [0, 1]. If you insert another element into the middle, you change the index of the last element to 2. `std::map` doesn't allow that in O(log(N)) – willir Jan 08 '18 at 01:17

1 Answers1

4

It seems to me that the way to implement* such a data structure you'd need a tree that stores the number of elements in each node. It allows insertion and retrieval in O(log(N)), and indices are maintained simply by counting how many elements there are to the "left" of a given node in the tree.

* I'm answering maybe a slightly different question here, the actual question asks for a recommendation for a library, which is explicitly off-topic at SO.

A node of this tree would look like this:

template<typename T>
struct Node {
  Node* left;
  Node* right;
  size_t elements;
  T value
  
  T& access(size_t index) {
    if (left->elements == index) {
      return value;
    } else if (left->elements > index) {
      return left->access(index);
    } else {
      return right->access(index - left->elements - 1);
    }
  }

  void insert(size_t index, T&& value) {
    // insert `value` at right place, increment `elements`
  }
}

(I left the insert method as an exercise to the reader.)

Edit: As willir (OP) mentioned in the comments below, the tree would need to be a balanced tree. Arne Vogel suggests that B-trees would be the best choice for cache locality.


But:

If you do implement something like this, make sure that you measure your application, compare it to std::vector. Inserting into a vector at an arbitrary location is O(N), not O(log(N)), but it's O(N) of a very cheap operation. A vector has many advantages over such a data structure:

  1. No code to maintain.

  2. Less stuff to store (in a tree you need to store two pointers and a count, which are not necessary in a vector), meaning more of your data fits in the cache at the same time.

  3. Data is always accessed in the same order in which it is stored. With trees you need to traverse nodes that can be stored in arbitrary locations in memory, they don't need to be close together, and are likely not stored in the order in which they're read.

Points 2 and 3 mean that vectors have many fewer cache misses. This can make for a huge difference in timings.

Moving data around in a vector becomes rather slow if each data element is large. But in this case you should keep pointers to your data in the vector, so that you're moving lists of pointers around, rather than your actual data. For such large data elements, I would suggest allocating each one independently, and keeping its pointer in a std::vector<std::unique_ptr<T>>.

Here are some relevant links:

Community
  • 1
  • 1
Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
  • 1
    I wrote this data structure a long time ago in C++, as a binary tree just like yours. Don't think I'll find the source in reasonable time. But I didn't know anything about cache locality or allocation strategies at the time. If I were to do it again, and it needed to be fast, I'd probably use an approach based on B-trees for better cache locality, and (at least) standard allocator support. – Arne Vogel Jan 08 '18 at 07:42
  • 2
    Yeah, also, just a binary tree is not enough. You need to keep it balanced. So Red-Black, Splay, or B-Tree. (I've implemented such thing via Splay Tree before, not for real project though). As was mentioned by @ArneVogel, B-Tree would be better because it should be more cache friendly. standard allocator support is another thing to remember. That is why I asked about a library - there are many things to be aware of if you want to implement it by yourself. – willir Jan 08 '18 at 14:18
  • 1
    My current question wasn't caused by some real project needs (Should've mentioned it in the question). I wondered "just in case" - Haskell is the only language I know that has such functionality. So it seems the answer is - no. C++ doesn't have such library. Maybe nobody just needs it. `std::vector` or `std::deque` work fast enough. Anyway, thanks for your response, and benchmark references. I think I'll mark your question as accepted (after waiting a bit to be sure nobody comes up with such library). Could you please add the info from my previous comment? – willir Jan 08 '18 at 14:30