0

I'm looking for an efficient data structure in C# that allows me to keep a list of items ordered (by the user) without duplicates.

What I mean by ordered by the user, ie.:

  • Insert element 1.
  • Insert element 2 before element 1.
  • Insert element 3 between 1 and 2. Then rearrange at will.

I will need the order to be continuously updated in a database upon change so that I can load it at start.

Operations I need:

  1. Insert at a given index
  2. Delete at a given index
  3. Move from index x to index y (could be expressed as the combination of 2 and 1 if there is no loss in performance)

All of these operations will be frequent and equally important.

dR_
  • 75
  • 9
  • Try the SortedDictionary https://msdn.microsoft.com/en-us/library/f7fta44c(v=vs.110).aspx SortedSet if you don't need KV pairs: https://msdn.microsoft.com/en-us/library/dd412070(v=vs.110).aspx – edtheprogrammerguy Apr 24 '17 at 16:48
  • 1
    So you want it ordered, but you want to be able to rearrange it? What value is it being ordered by? Is that value changing when you rearrange it? – MrZander Apr 24 '17 at 16:57
  • I'm sorry I wasn't explicit enough. Ordered by the user, ie. Insert element 1. Insert element 2 before element 1. Insert element 3 between 1 and 2. Then rearrange at will. Thank you – dR_ Apr 24 '17 at 16:59
  • Unless you have a column in the database to record the user's order, this is all academic. – Berin Loritsch Apr 24 '17 at 17:09
  • I am not so sure SortedSet or SortedDic do what you need as the key is sorted. It does not enforce uniqueness on the values. – paparazzo Apr 24 '17 at 17:16
  • 2
    How many items are you expecting to exist in this data structure? If the user is manually inserting items, it seem unlikely that you'll be getting 10,000+ items in it, am I right? Unless you hit a really large number of items, there is unlikely to be a noticeable difference between a `O(1)` and `O(n)` operations. – StriplingWarrior Apr 24 '17 at 17:28
  • 1
    There is a huge difference between doing this as an in-memory data structure and doing it as data stored in a database with live/interactive changes. Which is it? And what kind of database? – RBarryYoung Apr 24 '17 at 17:48
  • Great questions. Having a column in the database of course is mandatory. I haven't decided on the size of the structure yet. It can be about 100 items max if I decide to pass old items to history or it can keep adding up to the several thousands if I don't, and in this case users will mostly reorder the newer items. I'm not solving the problem just on an in-memory structure neither just on a database, it's both. Changes made to the data structure should be reflected in the MSSQL database. – dR_ Apr 24 '17 at 19:09
  • 2
    Just a suggestion for the database side of things. You could use a double for the "order" column. When you move an item in memory calculate its new order as the average of its new neighbors' order. Then you only have to update one database record since you sort them on load. (If an item gets too close, of course you should re-space them out, but that would take quite a few manual moves before that would become necessary.) – Mike Apr 24 '17 at 19:24
  • Are you ever planning to create relationships between the items on your list and other tables in your database? The reason I ask is that you could simplify things quite a bit using a document-oriented approach. SQL Server supports XML (and now even JSON) data types. If you're only expecting the list to grow to hundreds of items, you could easily save the entire list off in a single round-trip every time any item got reordered. – StriplingWarrior Apr 24 '17 at 20:17
  • If you have a small number of items then my solution should work fine. – paparazzo Apr 24 '17 at 20:54
  • @paparazzo , uniqueness in `SortedSet` can be easily achieved by throwing exception in comparer when it gets two equal values. – Victor Yarema Mar 09 '20 at 18:20

4 Answers4

7

I assume by "efficient" you mean asymptotically efficient. If that's not the case, then clarify the question.

The combination of indexing and arbitrary insertion is a tricky one.

  • List<T>s -- which are just a thin wrapper over arrays -- have O(1) insertion/deletion at the end, O(n) insertion/deletion at the beginning, and O(1) indexing. Checking uniqueness is O(n).
  • Linked lists have O(1) insertion/deletion provided you already know where you want to put the item, but O(n) indexing to find that location. Checking uniqueness is O(n)
  • Balanced binary trees have O(lg n) insertion and deletion and indexing if you're clever. Checking uniqueness is O(n). More exotic data structures like finger trees, skiplists, etc, are similar.
  • Hash sets have O(1) insertion and deletion but no indexing; checking uniqueness is O(1).

There is no single data structure that fits your needs. My advice is:

  1. Embrace immutability. Write an immutable data structure that meets your needs. It will be easier to reason about.
  2. Write a combination of a balanced binary tree -- red-black, AVL, etc -- and a hash set. The hash set is used only for uniqueness checking. The BBT has the number of items below it in every node; this facilitates indexing. The insertion and deletion algorithms are as normal for your BBT, except that they also rewrite the spine of the tree to ensure that the item count is updated correctly.

This will give you O(1) uniqueness checking and O(lg n) indexing, inserting and deleting.

I note that this data structure gives you O(1) answers to the question "is this item in the collection?" but O(n) answers to the question "where is it?" so if you need the inverse indexing operation to be fast, you have a much larger problem on your hands.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
1

I think I would just use a List and take O(n) Contains or a separate HashSet for uniqueness. List does all the other stuff nicely. Nicely as the operations are all there but most will be O(n). Even on 10,000 O(n) is pretty fast. The database calls are going to be the slowest part by far (try async).

    class MyCollection<T> : IList<T>
    {
        private readonly IList<T> _list = new List<T>();

        public void Insert(int index, T item)
        {
            if (this.Contains(item))
                throw new IndexOutOfRangeException();
            _list.Insert(index, item);
            //make database call
        }

        // implement all the other features of IList with database calls
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • _The database calls are going to be the slowest part by far_ Probably so much so that all other considerations are moot.. – TaW Apr 24 '17 at 18:52
  • @TaW That is why I just went with a simple IList – paparazzo Apr 24 '17 at 18:54
  • So you suggest that moving an item would be a delete + insert? How does this translate to the database side? Would the items have an integer and would those be updated in a similar fashion as what @StriplingWarrior suggested? If so, I'd be using multiple O(n) operations just for 1 task. – dR_ Apr 24 '17 at 22:57
  • You would have Move method that unders the cover is a Delete and Insert on the _list and should be single db call. Yes you could have two O(n) operations. Database call is still going to dominate. If you think you have a better way then use it. – paparazzo Apr 24 '17 at 23:08
1

This has kind of turned into two questions: one for the database layer and one for the in-memory collection. However, I think you can practically bring it back down to a single question if you let the database layer become your source of truth.

The reason I say this is that with roughly 100 items as the maximum likely number of active items in your list, you can pretty much ignore asymptotic complexity. Performance-wise, the most important thing to focus on when you've got this many items is round-trips across network connections (e.g. to the database).

Here's a fairly simple approach you can use. It's similar to something I've done in the past, with similar requirements. (I can't remember if it's exactly the same or not, but close enough.)

  1. Use a numeric Order column to determine the order of your items within the given list. int should be just fine.
  2. When you remove an item, decrement the orders of all items in the same list after that item. This can be done with a single UPDATE statement in SQL.
  3. When you add an item, give it an Order value based on the location it's added at, and increment the orders of all items in the same list after that item (again, with a single Update statement).
  4. When you move an item to a different location, change its Order and then increment or decrement the Orders of all the items between its starting and ending positions.
  5. Every time a change is made, re-load the entire list of items, in order, from the database to display to the user.

You may want to use stored procs to do more of this work in individual round-trips. Definitely a transactions to avoid race conditions.

An approach like this will easily scale for individual users editing individual lists. If you need scalability in terms of concurrent users, it's likely that another strategy like a NoSQL store is going to be the way to go. If you need to scale on many concurrent users editing the same list, things get really complicated and you may need to implement message buses and other goodness. If you find that you need to scale to tens of thousands of items in the list, you'll need to rethink your UI and how it communicates with the server (e.g. you won't want to load the entire list into memory). But when each of the operations is performed manually by a user, worrying about your in-memory data structure isn't going to get you where you want to be in any of these cases.

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • I agree with all the first 4 points. The list does not need to be reloaded for display since the user will be using drag and drop to visually rearrange the items. The only time the list needs loading is in the beginning. I would like to have concurrency with many users but it isn't a requirement right now, the current database is MSSQL and I don't have time to go overboard but that'd be a nice improvement. Good point about scaling up. In your opinion I'd work directly with the DB without a data structure? – dR_ Apr 24 '17 at 22:49
  • I don't see reloads from the db faster than operating on the List as in my answer. And then can do the db call async. – paparazzo Apr 24 '17 at 23:23
  • @dR_: Yes, the solution I'm proposing here would just use the database structure to represent the state of the list, and you'd just reload the current state of the list. As Paparazzi points out, this would require a database round-trip with every user action. For a better user experience, you could represent your data with an in-memory structure locally, and save the list away to the database asynchronously, but syncing the db and memory model could be more complicated that way. My answer isn't necessarily the best, but it addresses both the database and in-memory needs you brought up. – StriplingWarrior Apr 25 '17 at 17:03
0

In terms of data structures, a linked list is fast for inserts and deletes assuming you have direct reference to the nodes (in this case, you would want a doubly-linked list). I haven't used the built-in .NET LinkedList, but it seems it has some efficiency problems. You may want to just use a normal List if you have issues with LinkedList (Really depends on how "efficient" you need this to be.) See List time complexities here

As for saving it, all you need to do is save the index in your database and fill your collection from an query with an ORDER BY on start.

EDIT:

And for duplicate management, you could maintain a HashSet to check for duplicates and prevent insertion.

Community
  • 1
  • 1
MrZander
  • 3,031
  • 1
  • 26
  • 50
  • Inserting an item at an arbitrary index in a linked list is O(n). In addition to having fairly poor asymptotic complexity, the lack of memory locality means that they tend to perform poorer than alternatives in most any situation. – Servy Apr 24 '17 at 17:11
  • @Servy My answer is under the assumption that you have access to the node you are removing, in which case it is O(1) for a doubly-linked list. Given that OP is looking for "efficiency", he will obviously need to account for that and ensure he is accessing the nodes directly. And I noted the efficiency problems... the link directly compares a LL to a List. – MrZander Apr 24 '17 at 17:15
  • Yes, *if you already have a node* then removing it or inserting before/after it is reasonably fast. Of course, the OP is asking to insert a value to an arbitrary index, meaning *he won't have a node*. Finding the node at the relevant position will be O(n). – Servy Apr 24 '17 at 17:16
  • 1
    I'd argue it is most certainly _not_ arbitrary. His comment states he is inserting element X before element Y, meaning he has reference to the element. – MrZander Apr 24 '17 at 17:18
  • No, that *doesn't* mean that he is going to have a reference to that node. He's just describing the results. The operation specifically requested is fast inserting to an arbitrary index. – Servy Apr 24 '17 at 17:20
  • OK, well OP will need to decide if it fits his use case or not. I can't imagine any sort of UI where the user is rearranging/inserting/removing items and getting direct reference to the node is not a trivial & constant time task. – MrZander Apr 24 '17 at 17:28
  • Finding the node isn't going to be *difficult*, it's just going to result in this data structure performing worse that just about any other option out there. – Servy Apr 24 '17 at 17:29
  • In the OP's latest version of the question, he says "Insert element 2 before element 1", which makes me side with MrZander on the LinkedList aspect. However, this answer ignores the "without duplicates" aspect of the question. – StriplingWarrior Apr 24 '17 at 17:32
  • 1
    @StriplingWarrior Thanks for pointing that out, I updated the answer. – MrZander Apr 24 '17 at 17:34