5

I would like to understand the cost of storing a large number of items in the memory in C#. The data structure I need to use is a dictionary or similar. Let's say the number of items I would like to have is around 100 million, but the application will not immediately reach that number. It will take a long time till we are on the order of the limit.

I'm worried about the cost of operation that is amortised, but which I cannot afford to be too expensive at any given moment. So usually with dynamic data structures, when the structure is full, it re-allocates itself. In case of dictionary, I think it will even reindex every item. So let's say we are the point that application maintains 20 million items which just reaches the capacity of the dictionary. Then when a new dictionary storage is allocated, those 20 million items needs to be re-indexed.

This is why I was thinking that an array of dictionaries might be a good idea. Let's say I create 256 dictionaries.This immediately puts a bound on the size of each internal dictionary to less than 1 million items, which should be manageable to build up dynamically with all the indexing happening on the way up to 1 million items. The cost of that seems to be just one extra indexation per operation to find the correct dictionary to look into.

Is this a reasonable approach? Is my analysis correct or will the C# dictionary perform better I think for some reason? Is there any other solution that would be better? I'm looking for a data structure that has the same time complexity as C# dictionary.

Edit: The dictionary key is a random value, so I can just take the first bite of it to find my index into the array of 256 dictionaries very cheaply.

I'm currently not considering a database as I want all the items to be immediately available with very little cost. I do need look up in constant time with very little overhead. I can afford insert to be slower, but still constant time. Same with delete, can be little slower, but constant time required.

It should be possible to fit all the items in the memory. The items are small, about 50 bytes of data each. So the data structure must not have too much of an overhead for each item.

Wapac
  • 4,058
  • 2
  • 20
  • 33
  • 2
    According to the documentation "If Count is less than the capacity, this method approaches an O(1) operation. If the capacity must be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count." https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.dictionary-2.add?view=netframework-4.7.2 – ikkentim Aug 31 '18 at 11:48
  • How would you know which dictionary to look in? How do you know the number of items won't exceed 300 million? Since you're inherently trying to keep them all in memory through use of a dictionary you'll probably run into memory and swapping issues sooner or later. – GPW Aug 31 '18 at 11:48
  • 3
    It depends what operations do you need to do with data in your Dictionary... Could you give an example of one operations you will need to do? – Djuro Aug 31 '18 at 11:52
  • 5
    Just a silly suggestion, but; why don't you put it in a database and let that database handle this kind of issues for you? – Stefan Aug 31 '18 at 11:52
  • What types are used with dictionary? – Renatas M. Aug 31 '18 at 11:54
  • 1
    Let suppose you created array of 256 dictionaries, How you come to know which key is present in which dictionary. Don't you think it's an overhead to search for specific key in array of 256 dictionaries. – Safwan Shaikh Aug 31 '18 at 11:57
  • @SafwanShaikh and he needs to know 2 keys to access, if you need to know 2 keys or keys are not important i assume you dont need Dictionary at all... – Djuro Aug 31 '18 at 11:58
  • @Djuro Yes. The word "Dictionary" comes when you have something "Unique" to get specific instance. – Safwan Shaikh Aug 31 '18 at 12:00
  • If you need speed, Hashset is fast ( but still, depending on what operation you need to do) – Djuro Aug 31 '18 at 12:03
  • 2
    If you know your memory requirements (i.e. the # of items you expect in the dictionary), you can provide a capacity "up-front" with `new Dictionary(100000000)` and avoid the rebucketting issues you describe. – spender Aug 31 '18 at 12:04
  • 1
    ...however, with so many items, you might choose to investigate `gcAllowVeryLargeObjects` – spender Aug 31 '18 at 12:12
  • @SafwanShaikh I have a unique key with random distribution, so I can just use the first byte of it to find the correct dictionary at very low cost – Wapac Aug 31 '18 at 12:19
  • @Stefan too slow to use database, need all immediately in memory with just couple of instructions – Wapac Aug 31 '18 at 12:20
  • @Wapac: in general; the rue is simple: if you need top speed at run time: make sure everything fits (and is sorted) at the beginning (slow initialization time) if you can't fit it in memory; put it on disk (slow run-time access); every thing else is a hybrid compromise. So, the main question here is; is it going to fit? – Stefan Aug 31 '18 at 12:26
  • @Stefan It should be possible to fit all the items in the memory. So the data structure must not have too much of an overhead for each item. – Wapac Aug 31 '18 at 12:28
  • Why not use the Dictionary constructor overload that lets you specify an initial capacity? If you specified a big enough capacity, then the underlying array would never need to be allocated, and all calls to `.Add()` would be `O(1)`. This could, of course, result in a lot of wasted memory... – Matthew Watson Aug 31 '18 at 12:35
  • @Wapac - combine the suggestions to use `gcAllowVeryLargeObjects` (and target x64 only) and pre-initialization to expected max size (or next higher square number?) and *use only a single* dictionary. -- if you worry about the "overhead" of *one* dictionary, think about the overhead of 256 dictionaries. – Corak Aug 31 '18 at 12:37
  • @MatthewWatson `This could, of course, result in a lot of wasted memory.` Exactly because of that. As mentioned in the question, I expect to take long time to reach the maximum, so it feels like to preallocate everything from the start is very wasteful. – Wapac Aug 31 '18 at 12:40
  • @Corak I can't see why would 256 dictionaries present greater overhead per item than just one dictionary. The dictionary itself certainly has some overhead, but it should be negligible. Would `gcAllowVeryLargeObjects` somehow prevent wasting the memory for not used slots before the dictionary gets anywhere close to the limit? – Wapac Aug 31 '18 at 12:43
  • 1
    @Wapac - you also mentioned that speed to add/access is very critical. So this seems like a classic time-space-trade-off -- why would there be "higher" overhead per item if you only use *one* dictionary? – Corak Aug 31 '18 at 12:43
  • I actually think an "array of Dictionaries" would work fine. – Matthew Watson Aug 31 '18 at 12:45

3 Answers3

6

UPDATE: I've edited this since I posted it:

  • Store a fixed size object (byte[50] with each pass,
  • pre-allocate all these before adding to the dictionary (rather than creating objects within the loop)
  • run GC.Collect() after pre-allocating stuff.
  • gcAllowVeryLargeObjects set to true.
  • definitely set for x64 (it was before, but then I switched to 'Release' to build and run outside VS... and it reset, so oops.)
  • tried it with and without pre-allocating the dictionary size.

Here is the code now:

var arrays = new byte[100000000][];
System.Diagnostics.Stopwatch stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start();
for (var i = 0; i<100000000; i++)
{
    arrays[i] = new byte[50];
}
stopwatch.Stop();
Console.WriteLine($"initially allocating arrays took {stopwatch.ElapsedMilliseconds} ms");
stopwatch.Restart();

GC.Collect();
Console.WriteLine($"GC after array allocation took {stopwatch.ElapsedMilliseconds} ms");

Dictionary<int, byte[]> dict = new Dictionary<int, byte[]>(100000000);
//Dictionary<int, byte[]> dict = new Dictionary<int, byte[]>();

for (var c = 0; c < 100; c++)
{
    stopwatch.Restart();
    for (var i = 0; i < 1000000; i++)
    {
        var thing = new AThing();
        dict.Add((c * 1000000) + i, arrays[(c*100000)+i]);
    }
    stopwatch.Stop();
    Console.WriteLine($"pass number {c} took {stopwatch.ElapsedMilliseconds} milliseconds");
}

Console.ReadLine();

Here is the output when I DON'T pre-allocate the size of the dictionary:

initially allocating arrays took 14609 ms
GC after array allocation took 3713 ms
pass number 0 took 63 milliseconds
pass number 1 took 51 milliseconds
pass number 2 took 78 milliseconds
pass number 3 took 28 milliseconds
pass number 4 took 32 milliseconds
pass number 5 took 133 milliseconds
pass number 6 took 41 milliseconds
pass number 7 took 31 milliseconds
pass number 8 took 27 milliseconds
pass number 9 took 26 milliseconds
pass number 10 took 45 milliseconds
pass number 11 took 335 milliseconds
pass number 12 took 34 milliseconds
pass number 13 took 35 milliseconds
pass number 14 took 71 milliseconds
pass number 15 took 66 milliseconds
pass number 16 took 64 milliseconds
pass number 17 took 58 milliseconds
pass number 18 took 71 milliseconds
pass number 19 took 65 milliseconds
pass number 20 took 68 milliseconds
pass number 21 took 67 milliseconds
pass number 22 took 83 milliseconds
pass number 23 took 11986 milliseconds
pass number 24 took 7948 milliseconds
pass number 25 took 38 milliseconds
pass number 26 took 36 milliseconds
pass number 27 took 27 milliseconds
pass number 28 took 31 milliseconds
..SNIP lots between 30-40ms...
pass number 44 took 34 milliseconds
pass number 45 took 34 milliseconds
pass number 46 took 33 milliseconds
pass number 47 took 2630 milliseconds
pass number 48 took 12255 milliseconds
pass number 49 took 33 milliseconds
...SNIP a load of lines which are all between 30 to 50ms...
pass number 93 took 39 milliseconds
pass number 94 took 43 milliseconds
pass number 95 took 7056 milliseconds
pass number 96 took 33323 milliseconds
pass number 97 took 228 milliseconds
pass number 98 took 70 milliseconds
pass number 99 took 84 milliseconds

you can clearly see the points where it has to re-allocate. I'm guessing simply by doubling the size of the list and copying the current list items, as there's a long stretch at the end where it doesn't do this. Some of these are very expensive though (30+ seconds! ouch)

and here is the output if I do pre-allocate the dictionary size:

initially allocating arrays took 15494 ms
GC after array allocation took 2622 ms
pass number 0 took 9585 milliseconds
pass number 1 took 107 milliseconds
pass number 2 took 91 milliseconds
pass number 3 took 145 milliseconds
pass number 4 took 83 milliseconds
pass number 5 took 118 milliseconds
pass number 6 took 133 milliseconds
pass number 7 took 126 milliseconds
pass number 8 took 65 milliseconds
pass number 9 took 52 milliseconds
pass number 10 took 42 milliseconds
pass number 11 took 34 milliseconds
pass number 12 took 45 milliseconds
pass number 13 took 48 milliseconds
pass number 14 took 46 milliseconds
..SNIP lots between 30-80ms...
pass number 45 took 80 milliseconds
pass number 46 took 65 milliseconds
pass number 47 took 64 milliseconds
pass number 48 took 65 milliseconds
pass number 49 took 122 milliseconds
pass number 50 took 103 milliseconds
pass number 51 took 45 milliseconds
pass number 52 took 77 milliseconds
pass number 53 took 64 milliseconds
pass number 54 took 96 milliseconds
..SNIP lots between 30-80ms...
pass number 77 took 44 milliseconds
pass number 78 took 85 milliseconds
pass number 79 took 142 milliseconds
pass number 80 took 138 milliseconds
pass number 81 took 47 milliseconds
pass number 82 took 44 milliseconds
..SNIP lots between 30-80ms...
pass number 93 took 52 milliseconds
pass number 94 took 50 milliseconds
pass number 95 took 63 milliseconds
pass number 96 took 111 milliseconds
pass number 97 took 175 milliseconds
pass number 98 took 96 milliseconds
pass number 99 took 67 milliseconds

Memory usage goes up to just over 9GB whilst initially creating the arrays, down to about 6.5GB after the GC.Collect, climbs back up to over 9GB whilst adding to the dictionary, then after all is done (and it's sat waiting on a console.Readline()) after a bit it drops back down to ~3.7GB and stays there.

It's clearly far faster in operation to pre-allocate the dictionary though.

For Reference, original below*

I just wrote this little test. I have no idea WHAT you're storing, so I've just created a little pointless class with not much info and used an int as the key, but my two takeaways from this are:

  1. it doesn't appear to get progressively slower at adding to the dictionary until it gets to around 40 million items. Running a 'Release' build targeted for x64 it took around 500ms for every million inserts, and then 41 through 46 took more like 700-850ms (noticable jump at that point)

  2. It got to somewhere over 46,000,000 items, ate about 4GB of RAM and fell over with an Out Of memory exception.

  3. Use a database, or the Dictionary abuse team will come and take you down.

code:

class AThing
{
    public string Name { get; set; }
    public int id { get; set; }
}
class Program
{
    static void Main(string[] args)
    {
        Dictionary<int, AThing> dict = new Dictionary<int, AThing>();

        for (var c = 0; c < 100; c++)
        {
            DateTime nowTime = DateTime.Now;
            for (var i = 0; i < 1000000; i++)
            {
                var thing = new AThing { id = (c * 1000000) + i, Name = $"Item {(c * 1000000) + i}" };
                dict.Add(thing.id, thing);
            }
            var timeTaken = DateTime.Now - nowTime;
            Console.WriteLine($"pass number {c} took {timeTaken.Milliseconds} milliseconds");
        }

    }
}
GPW
  • 2,528
  • 1
  • 10
  • 22
  • 1
    I am still wondering what OP is storing; it's a big difference if it's a single byte or a 4K image. – Stefan Aug 31 '18 at 12:23
  • 2
    Just as an small hint. Use `System.Diagnostics.Stopwatch` for measuring execution time. https://stackoverflow.com/a/28648/4634044 – ChristianMurschall Aug 31 '18 at 12:30
  • I have updated the question to include the size of the item. It is quite interesting that you have 4 GB of memory consumed with just 46m items, this suggest there is some kind of big overhead somewhere. – Wapac Aug 31 '18 at 12:32
  • @Wapac - well, you have *at least* the size of a reference to `AThing` (basically an address in memory) to even be able to get a hold on the object, then four bytes for the `id` property and two bytes for every character of the `Name` with the `'\0` at the end. and probably a bit (haha) more since properties are methods with their own "pointer" size and so on **for each item**. -- and **that** would be the same cost if you put all in *one* dictionary, or somehow separate them over a few hundred dictionaries. – Corak Aug 31 '18 at 12:54
  • @linuxrocks - thanks, I'll try to remember that one :) – GPW Aug 31 '18 at 13:12
  • @GPW well, for such long runs as in your example we don't need high resolution timers anyway. – ChristianMurschall Aug 31 '18 at 13:20
  • 1
    @Wapac - honestly, I was surprised too.. I was just wondering whether it would have noticable jumps where it had to allocate more memory or anything, but wasn't expecting it to run out so quickly. Obviously for the sake of this tiny example I didn't think it mattered too much... but doing the maths, 46,000,000 instances of 50 bytes is over 2GB anyway, so at 100,000,000 you'll be happily over 4GB. I'm not suggesting it's impossible to hold all this data in memory for quick access... I just think that a c# Dictionary is probably not the thing to use. – GPW Aug 31 '18 at 13:23
  • 1
    @Wapac I've updated the sample. It consumes *considerably* more memory to create objects in x64 mode, I guess because of 64-bit pointers? it doesn't run out of memory now, and it doesn't appear to lose performance directly due to the number of items in the dictionary. No idea how Lookups may be affected though. – GPW Aug 31 '18 at 14:26
  • 1
    Dictionary was never designed for this purpose. Firstly you probably want a 64 bit hash state at a minimum -- assuming you have control over the hash computation. You should use a good hash function - not the ones built in. You should cache the hash of every entry. Secondly, Dictionary wastes a lot of time and space with prime numbers. Create your own hierarchical hash table, using power of two table sizes, and you can probably do much better. – Frank Hileman Apr 19 '19 at 21:31
1

I know three years have passed since you initially asked this question. But I've myself ran into the same problem and managed to find a solution by implementing a FixedSizeDictionary<TKey, TValue> where I pass the the maximum size as an int and while it keeps adding items to it, it will also remove the oldest after the items count gets passed by the fixed value provided.

Damian
  • 131
  • 1
  • 7
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 19 '21 at 21:41
0

If the program is expected to work while the dictionary is at maximum size, then why don't you just allocate it to maximum size from the beginning and avoid re indexing and such completely. The amount of memory used only differs from the other solution temporarily, but the time saved is not temporary, plus, when the dictionary is in it's emptier state there is a very low chance of collisions.