1

I want a structure that can :-

  1. add new element fast (an element is always a pointer, not data)
  2. remove existing element fast using only information where the pointer points to.
    (B* b1,B* b2, test b1==b2 directly)
    The removing tends to occur in random order (all over places).
    User calls remove(T*).
  3. iterate fast

The data structure that suits the criteria seems to be HashSet.
However, Hashset suffers these disadvantages :-

  1. Adding an element require hash function which is slow.
  2. Adding an element : need to access some kind of basket management, even more slow.
  3. Removing an element suffers the same symptom as adding.
  4. Iterating suffers from basket access O(size of basket), and have to travel pass empty basket.

I tested it (n = 500 to 1000), it is very slower than array.

Question

What data structure can support this requirement?

My poor solution

Only solution I found is to store index directly into the element.
It is fast and meet all the criteria, but seems to be a hack.

With this hack, there is an additional strange limitation.
-> A certain element can only be contained in one DataStructure.

If I want more, the hacking become more complicate.

class B{
    int indexSlot1_;
    public: static int& indexSlot1(B* b){ return b->indexSlot1_; }
    //... provide additional slots if required ....
}
Datastructure<B*,&B::indexSlot1> database1;
Datastructure<B*,&B::indexSlot2> database2;
javaLover
  • 6,347
  • 2
  • 22
  • 67
  • 5
    `I tested it, it is very slower than array. ` you tested it on what data size? – SingerOfTheFall Jan 10 '17 at 07:09
  • Probably a vector, sorted by elements casted to `uintprt_t` would do. – Joker_vD Jan 10 '17 at 07:09
  • @SingerOfTheFall I believed 500 to 1000. I profiled it. The report said that the bucket access eat huge % of CPU, while array access virtually cost nothing. – javaLover Jan 10 '17 at 07:11
  • @Joker_vD It suffers adding cost = O(log N), right? – javaLover Jan 10 '17 at 07:12
  • 1
    @javaLover, test it on a million, you might change your mind. Of course you might never have that much elements in your real application, but then if you say "I want it to be fast" you should first define what "fast" means to you. – SingerOfTheFall Jan 10 '17 at 07:12
  • 4
    Iterating quickly - you can't get better than a compact structure like an std::vector. The caches impact is huge. Question - do you care if the data is sorted or not? – Ivan Rubinson Jan 10 '17 at 07:12
  • @Ivan Rubinson std::vector is good enough for iterating. I don't care sorting. – javaLover Jan 10 '17 at 07:13
  • Where do you add/remove elements? Always at either of the ends? Or all over the place? – Ivan Rubinson Jan 10 '17 at 07:14
  • @Ivan Rubinson all over the place, I will edit my question. Thank – javaLover Jan 10 '17 at 07:14
  • Do you care about memory usage? – Ivan Rubinson Jan 10 '17 at 07:17
  • @SingerOfTheFall Good point about "million". However, I will use it only for 500-1000. I have to add/remove/iterate very frequently. – javaLover Jan 10 '17 at 07:17
  • @Ivan Rubinson I care somehow. It is not the first priority though. It should not much more ugly than hashset, but this criteria is relax. – javaLover Jan 10 '17 at 07:19
  • 2
    Sounds a lot like what Stroustrup talks about in [Why you should avoid linked lists (and use vector instead)](https://youtu.be/YQs6IC-vgmo) – Ivan Rubinson Jan 10 '17 at 07:19
  • 4
    Difficult to get all three it is. Trade-offs you must make. Chose which you need more, you must – user4581301 Jan 10 '17 at 07:19
  • 2
    If memory is not an issue, you can make a combination of a vector and double-linked list (for example you can store the data as a list, but also have a vector of pointers to each data element, or something else, depending on the exact operations that you perform on the data). This way you can iterate by using the list, and access elements by key by using the vector – SingerOfTheFall Jan 10 '17 at 07:20
  • @user4581301 my hacking data structure get all three perfectly ... heehee – javaLover Jan 10 '17 at 07:20
  • Personally, I'd make a vector of arrays (think: chunks) and keep track of how much of the array I use. Adding an element won't require allocating memory (unless you run out every once in a while), removing doesnt involve a deallocation, and iterating is fast. – Ivan Rubinson Jan 10 '17 at 07:23
  • 1
    @Ivan Rubinson remove cost O(n) to search index – javaLover Jan 10 '17 at 07:24
  • @SingerOfTheFall Is it a pool? All T have to be born from a share pool? It is an interesting idea. I still don't understand it enough. – javaLover Jan 10 '17 at 07:26
  • @javaLover you are right. I can keep things sorted internally to improve the complexity. However, this means pointers to data get invalidated with each addition / removal. I'd have to make an iterator class. – Ivan Rubinson Jan 10 '17 at 07:26
  • You are looking for a data structure that does *everything* better than a hash table. Not very likely to happen. If such a beast existed, everyone would be preferring it over the hash table. – n. m. could be an AI Jan 10 '17 at 07:27
  • 1
    @n.m. Disagree. Hashtable provide value comparison, I just want unique-and-raw pointer comparison. If this sentence is actually has no meaning in term of design, yes you are correct. – javaLover Jan 10 '17 at 07:28
  • @Ivan Rubinson that pretty much sounds like std::vector or list – Swift - Friday Pie Jan 10 '17 at 07:30
  • 3
    @javaLover Your values *are* pointers. – n. m. could be an AI Jan 10 '17 at 07:30
  • @n.m. hmm, that poses problem in that the value of pointer is platform implementation dependent. – Swift - Friday Pie Jan 10 '17 at 07:32
  • @n.m. Understand, but at least, I tried to think it is not. I hope there must be some ways to exploit this special *value* (pointer). If you are correct, there are simply no way, so I don't believe you. I am an optimistic person. :) – javaLover Jan 10 '17 at 07:33
  • 2
    You wanted to be "using only information where the pointer points to". In other words, the value of the pointer.That's implementation dependent, yes. – n. m. could be an AI Jan 10 '17 at 07:33
  • "there are simply no way, so I don't believe you". You are welcome to find a way and get rich. I know I would if I could. – n. m. could be an AI Jan 10 '17 at 07:35
  • 2
    When you know exactly the data or have a predictable pattern to exploit, you can get miraculous performance. But you can also forget about reusing the scheme for another task with different data an access patterns. You may have found a a structure that works brilliantly for a set of circumstances. Awesome. Use it in those circumstances. Profile the smurf out of it before deploying it outside those circumstances. – user4581301 Jan 10 '17 at 07:46
  • 1
    If your standard hash table implementation is too slow, it might be time to roll your own: Creating a solid hash for a pointer is fast (just a few bit shifts and xor operations), and bin management can boil down to linked lists with only a single element on average. Most importantly, it's possible to arrange the data is such a way, that its stored sequentially in an array, so iterating is as fast as it can get. – cmaster - reinstate monica Jan 10 '17 at 07:53
  • @cmaster Thank about bit shift. I dislike the bucket miss. It is impractical for me to store data sequentially in an array. The farthest I can reach practically seems to be using many pools with exponential size join together (1:2:4:8...), e.g. `[A][Ax][xxAA][AAxxAxAx] (A=data,x=null)` And the `Datastructure` is only a subset of pool. – javaLover Jan 10 '17 at 08:03
  • Why is it a problem to store the data sequentially? I thought your data is only pointers anyway? Of course, you need two more arrays of the same size, one to hold the hash index, and one to hold the linked list pointers (the later can be merged together into a struct with the data pointers), but this indirection enables you to control the layout of the data within the array, allowing for really fast iteration. The overheads are constant per entry and small. – cmaster - reinstate monica Jan 10 '17 at 08:09
  • @cmaster Problem to store sequentially is that I don't know maximum amount of it and I often cache some pointer of them too ; Does your solution have same limitation as the hacking "my poor solution"? (`T` can be member of at most 1 `DataStructure`) Additional slot (an additional array with same size) have to be created for more than 1 concurrent `DataStructure`. – javaLover Jan 10 '17 at 08:15
  • What is the average lifetime of your pointer from insertion to deletion? How is it distributed? – n. m. could be an AI Jan 10 '17 at 08:51
  • @n.m. Various, I have never measured. They are mostly game objects. I think they can be roughly divided into long-live (>5 second) and short-live, I am not sure. My best guess is that the distribution is bell curve (normal). – javaLover Jan 10 '17 at 08:52
  • 2
    This is good. You may want to have separate data structures for long-lived and short-lived data. Consider a simple array paired with a hash table. Long-lived entries are at the beginning of the array and have a hash entry, short-lived ones are at the end of the array and no entry in the hash table. To insert, you just push back; iteration is obvious; to delete, scan the short-lived end of the array, then use the hash. Try different strategies of when to move an element from the short-lived pile to the long-lived. – n. m. could be an AI Jan 10 '17 at 09:02
  • @n.m. Thank as always, n.m. You forced me to think in a more productive way. By the way, I was not clear, sorry. A certain data-structure tend to contains only short / only long-live. From your advise, I may ought to think it as 2 problems (use different data structure for each one). – javaLover Jan 10 '17 at 09:10
  • Well, my standard solution to the unknown max size problem is to just grow in factors of two. Of course, that invalidates all pointers into the hash slots, but they should not be known to code outside the hash table anyway. However, this is likely irrelevant anyway, since @n.m. has made quite a good proposal already (probably better than mine). As to the problem of concurrent access: Any locking mechanism you use will dwarve the overhead of actually managing the data anyway. So, if that's your main problem, you will need to find ways to reduce inter-thread communication. – cmaster - reinstate monica Jan 10 '17 at 10:02

0 Answers0