Whats the best collection to use for uniquely identifying nodes?

Question

Currently I am using a Dictionary<int,node> to store around 10,000 nodes. The key is used as an ID number for later look up and the 'node' is a class that contains some data. Other classes within the program use the ID number as a pointer to the node. (this may sound inefficient. However, explaining my reasoning for using a dictionary for this is beyond the scope of my question.)

However, 20% of the nodes are duplicate. What i want to do is when i add a node check to see if it all ready exists. if it does then use that ID number. If not create a new one.

This is my current solution to the problem:

public class nodeDictionary 
{

    Dictionary<int, node> dict = new Dictionary<int, node>( );
    public int addNewNode( latLng ll )
    {
        node n = new node( ll );
        if ( dict.ContainsValue( n ) )
        {
            foreach ( KeyValuePair<int, node> kv in dict )
            {
                if ( kv.Value == n )
                {
                    return kv.Key;
                }
            }
        }
        else
        {
            if ( dict.Count != 0 )
            {
                dict.Add( dict.Last( ).Key + 1, n );
                return dict.Last( ).Key + 1;
            }
            else
            {
                dict.Add( 0, n );
                return 0;
            }
        }
        throw new Exception( );
    }//end add new node
}

The problem with this is when trying to add a new node to a list of 100,000 nodes it takes 78 milliseconds to add the node. This is unacceptable because i could be adding an additional 1,000 nodes at any given time.

So, is there a better way do do this? I am not looking for someone to write the code for me, I am just looking for guidance.

score 3 · Accepted Answer · edited May 23 '17 at 11:55

It sounds like you want to

make sure that LatLng overrides Equals/GetHashCode (preferrably implement the IEquatable<LatLng> interface)
stuff all the items directly into a HashSet<LatLng>

For implementing GetHashCode, see here: Why is it important to override GetHashCode when Equals method is overridden?

If you need to generate 'artificial' unique IDs in some fashion, I suggest you use the dictionary approach again, but 'in reverse':

// uses the same hash function for speedy lookup/insertion
IDictionary<LatLng, int> idMap = new Dictionary<LatLng, int>(); 

foreach (LatLng latLng in LatLngCoords)
{
    if (!idMap.ContainsKey(latLng))
        idMap.Add(latLng, idMap.Count+1); // to start with 1
}

You can have the idMap replace the HashSet<>; the implementation (and performance characteristics) is essentially the same but as an associative container.

Here's a lookup function to get from LatLng to Id:

int IdLookup(LatLng latLng)
{
     int id;
     if (idMap.TryGetValue(latLng, id))
         return id;
     throw new InvalidArgumentException("Coordinate not in idMap");
}

You could just-in-time add it:

int IdFor(LatLng latLng)
{
     int id;
     if (idMap.TryGetValue(latLng, id))
         return id;

     id = idMap.Count+1;
     idMap.Add(latLng, id);
     return id;
}

so i did this and it the stuffing process is now lightning fast (about 80 ms for 100,000 nodes). But i still need an id number... — Chad Carisch, Nov 01 '11 at 18:50
Updated my answer with _a possible_ strategy to generate int ids. Note that the ids generated will depend on the order in which the coordinates are added to the dictionary. You cannot rely on this scheme to be 'stable' across program runs (unless the order and values of insertions is entirely deterministic all the time). — sehe, Nov 01 '11 at 19:05
While that is not a perfect strategy, I now understand how to tackle this problem. Thanks for your help! I will try to post my final code when I am done. — Chad Carisch, Nov 01 '11 at 19:12

Branko Dimitrijevic · Answer 2 · 2011-11-01T18:44:37.500

1

What exactly is the purpose of this code?

if ( dict.ContainsValue( n ) )
{
    foreach ( KeyValuePair kv in dict )
    {
        if ( kv.Value == n )
        {
            return kv.Key;
        }
    }
}

The ContainsValue searches for a value (instead of a key) and is very inefficient (O(n)). Ditto for foreach. Let alone you do both when only one is necessary (you could completely remove ContainsValue by rearranging your ifs a little)!

You should probably mainntain additional dictionary that is "reverse" of the original one (i.e. values in old dictionary are keys in the new one and vice versa), to "cover" your search patterns (similarly to how databases can maintain multiple indexes par table to cover multiple ways table can be queried).

edited Nov 01 '11 at 18:44

answered Nov 01 '11 at 18:19

Branko Dimitrijevic

50,809
10
93
167

In the code, the generic version is used (the non-generic version is called `HashTable`). Also, it's not clear to me whether you argue for or against `HashSet`). – svick Nov 01 '11 at 18:23
@svick I see you edited the code to use generics. Why? I commented on original code. BTW I'am (potentially) **for** `HashTable` (edited). – Branko Dimitrijevic Nov 01 '11 at 18:26
Sorry, I didn't notice that. The code was generic before my edit, but it didn't look that way, because the formatting used didn't show the parts in `<>`. The code wouldn't make sense otherwise, there is no non-generic `Dictionary` in .Net. – svick Nov 01 '11 at 18:31
Apparently I confused non-generic `IDictionary` which exists with non-generic `Dictionary` which doesn't. Mea culpa ;) – Branko Dimitrijevic Nov 01 '11 at 18:40

score 1 · Answer 3 · answered Nov 01 '11 at 18:19

I'd add a second dictionary for the reverse direction. i.e. Dictionary<Node,int>

Then you either

Are content with reference equality and do nothing.
Create an IEqualityComparer<Node> and supply it to the dictionary
Override Equals and GetHashCode on Node

In both cases a good implementation for the hashcode is essential to get good performance.

score 1 · Answer 4 · answered Nov 01 '11 at 18:20

Your solution is not only slow, but also wrong. The order of items in a Dictionary is undefined, so dict.Last() is not guaranteed to return the item that was added last. (Although it may often look that way.)

Using an id to identify an object in your application seems wrong too. You should consider using references to the object directly.

But if you want to use your current design and assuming that you compare nodes based on their latLng, you could create two dictionaries: the one you already have and a second one, Dictionary<latLng, int>, that can be used to efficiently fond out whether a certain node already exists. And if it does, it gives you its id.

Referencing to the object directly would not be possible in the big picture. But this is beyond the scope of the question. — Chad Carisch, Nov 01 '11 at 18:37

score 0 · Answer 5 · answered Nov 01 '11 at 18:00

0

You could try using HashSet<T>

answered Nov 01 '11 at 18:00

InBetween

32,319
3
50
90

score 0 · Answer 6 · answered Nov 01 '11 at 18:14

0

You might want to consider restructuring this to just use a List (where the 'key' is just the index into the List) instead of a Dictionary. A few advantages:

Looking up an element by integer key is now O(1) (and a very fast O(1) given that it's just an array dereference internally).
When you insert a new element, you perform an O(n) search to see whether it already exists in the list. If it does not, you've also already traversed the list and can have recorded whether you encountered an entry with a null record. If you have, that index is the new key. If not, the new key is the current list Count. You're enumerating the collection once instead of multiple times and the enumeration itself is much faster than enumerating a Dictionary.

answered Nov 01 '11 at 18:14

Dan Bryant

27,329
4
56
102

You could do the same in O(1) for both operations, if you used two hash tables (either two Dictionaries or a Dictionary and a HashSet). – svick Nov 01 '11 at 18:24
The problem with this is; if I remove a node, all of the IDs would have to shift. This would over complicate things. Although I could add a dummy node in the place of the node that got removed..... – Chad Carisch Nov 01 '11 at 18:33
@Chad, The intention would be to use null for empty nodes. If you have a value type, you can either use a nullable as your storage type or a sentinal (dummy) value to indicate empty. – Dan Bryant Nov 01 '11 at 18:45
@Chad, Upon further reflection, one downside to this approach is that it suffers from a stale reference problem due to the reuse of previously removed ids. Use of an old reference could incorrectly give a 'new' object value, rather than causing an error. – Dan Bryant Nov 01 '11 at 19:09

Whats the best collection to use for uniquely identifying nodes?

6 Answers6