2

I have a relatively large set of data that lends itself very naturally to the c#'s dictionary object. Currently, I have 102400 key-value pairs that are being generated semi-dynamically when my program starts up. My problem is that I have to run a great number of lookup operations as fast as possible.

According to This Page the speed of the lookups is directly influenced by the number of key-value pairs in the dictionary. My data is a bit odd in that a great number of different keys lead to the same value. In fact, I have only 4900 distinct values... This means that I have an average of 20 key-value pairs for each distinct value.

My first instinct was to swap the keys for the values (as I only care about the distinct values in the data) then have the old keys in a list or array as the new values. This reduced my dictionary size to 4900 from 102400 key-value pairs, but I can't see any way to efficiently search all of the lists for a specific value to get the key by.

I know that my description probably got a bit dificult to follow as I switched the keys and values, so I've included a mock-up of my data to show you what I mean:

old method:

Key   Value
---   -----
1     1
2     2
3     3
4     1
5     3
6     2
7     2
8     1
9     3
10    2
11    3
12    1

New structure:

Key   Value
---   -----
1     {1,4,8,12}
2     {2,6,7,10}
3     {3,9,5,11}

In my program, I'm going to be given '11' and I'll need to return '3'. The first structure is a simple lookup, but is a huge list which seems to be slowing things down... the second adds so much logical overhead to track down which value list I'm looking for that I've only seen a reduction in speed trying to implement it.

Am I barking up the wrong tree here? Should I just accept the speed of the larger list, or is there some other way that I can store my data to increase the lookup speed?

Chronicide
  • 1,112
  • 1
  • 9
  • 32
  • Dictionaries should be O(1) in lookup, which means lookup time should remain relatively constant even when the dictionary grows. However, the issue you might be having is the key not having a well-distributed hash. Is your dictionary a `Dictionary` or is the key a different (custom?) type? – Anthony Pegram Jan 03 '12 at 18:49
  • My dictionary is a Dictionary. I've received a lot of input from the various answers, and i've been trying to implement some speed tests to see if the various suggestions have a noticable impact. I'll add a few tests of my own to see if I do see any increase in speed for smaller dictionaries of primitive types. – Chronicide Jan 03 '12 at 19:36

5 Answers5

2

If all keys are distinct and contiguous, then you should consider a simple array; if the keys aren't contiguous, then a hash map type of structure if they aren't. This would be approaching O(1) if the hashing function is good, and if they are all integers, shouldn't take up much space.

Even then, for 102400 elements, a binary tree lookup would take at most log2(102400) operations per lookup which is 16.64 operations, not exactly slow.

PlexQ
  • 3,154
  • 2
  • 21
  • 26
  • I've done some speed tests, and using a simple ulong[102400] is up to 10x faster than a dictionary of the same size. I'll be trying a few other things, but I'll mark yours as the answer as it was the first to mention using a simple array (I feel a bit foolish for not thinking of using a simple array myself..) – Chronicide Jan 03 '12 at 19:48
2

Use a Lookup (.NET 3.5 and above).

From MSDN:

Lookup(Of TKey, TElement)

Represents a collection of keys each mapped to one or more values.

EDIT: By the way, if all your keys are contiguous (ie 1, 2, 3, ...), use a simple array.

user703016
  • 37,307
  • 8
  • 87
  • 112
  • 1
    This is a Linq class and is described as immutable and with no public constructor, I'm not a c# expert, but how would you build this if you weren't a Linq query? – PlexQ Jan 03 '12 at 19:02
  • @PlexQ You need a Linq query to build it. – user703016 Jan 03 '12 at 19:12
1

Using your parameters of a Dictionary<int, ulong>, 20 key/value pairs per unique value, a size of 102,400 total key/value pairs, and the code you linked to, I ran a test on a 102,400 count dictionary and one ten times that size:

    int entries = 102400;
    var full = new Dictionary<int, ulong>();
    var half = new Dictionary<int, ulong>();
    var both = new Dictionary<int, ulong>();

    for (int i = 0; i < entries * 10; i++)
    {
        full.Add(i, (ulong)(i % 20));
        if (i < entries)
        {
            both.Add(i, (ulong)(i % 20));
            half.Add(i, (ulong)(i % 20));
        }
    }

    const int m = 100;
    Stopwatch s1 = Stopwatch.StartNew();
    for (int i = 0; i < m; i++)
    {
        foreach (var key in both.Keys)
        {
            if (!full.ContainsKey(key))
            {
                throw new Exception();
            }
        }
    }
    s1.Stop();

    Stopwatch s2 = Stopwatch.StartNew();
    for (int i = 0; i < m; i++)
    {
        foreach (var key in both.Keys)
        {
            if (!half.ContainsKey(key))
            {
                throw new Exception();
            }
        }
    }
    s2.Stop();
    Console.WriteLine("{0},{1}, difference = {2}", s1.ElapsedMilliseconds, s2.ElapsedMilliseconds, s1.ElapsedMilliseconds - s2.ElapsedMilliseconds);

Both tests finished within 10 miliseconds of each other.

I don't think your lookup speed is the issue here.

GalacticJello
  • 11,235
  • 2
  • 25
  • 35
0

A dictionary is the way to go if your keys are non-contiguous. I'm not aware of any faster lookup method for that kind of data. Your example shows contiguous, sequential data which could benefit from storing your values directly in an array and jumping directly to to correct index based off the key. As long as the keys for your real data mimics your example keys I would go with an array.

Bradley Uffner
  • 16,641
  • 3
  • 39
  • 76
0

One time you made your new strutcure, something like this, as much as I understood,

Dictionary<first, List<second>>, where first and second are integers. You can take care about a fact that the content of List<second> is an ordered.

Considering that you challenge is not fast composition of data, but fast access and recovery, having the List<second> will let you execute secure List,BinarySearch, the fastest possible way to find a data among the list items.

Tigran
  • 61,654
  • 8
  • 86
  • 123
  • I'm sorry, but a binary search, which is O(log n) for a dataset with high simple key cardinality (and given they're unique it is), is definitely not the fastest way to find data. Hashing is almost always faster. – PlexQ Jan 03 '12 at 18:53
  • @PlexQ: what you mean, by saying "hashing" considering that we are talking about `List` here? – Tigran Jan 03 '12 at 19:14