5

I am using Dictionary and I need to store almost 13 000 000 keys in it. Unfortunatelly, after adding 11 950 000th key I got an exception "System out of memory". Is there any solution of this problem? I will need my program to run on less powerable computers than is actually mine in the future..

I need that many keys because I need to store pairs - sequence name and sequence length, it is for solving bioinformatics related problem.

Any help will be appreciated.

Perlnika
  • 4,796
  • 8
  • 36
  • 47
  • 13 million keys in ram are really too much for a normal system! – Salvatore Previti Oct 24 '11 at 14:06
  • Must all of these reside in memory? – Oded Oct 24 '11 at 14:06
  • 1
    Use any one them http://nosql-database.org/ – L.B Oct 24 '11 at 14:07
  • Do you need all the keys in one Dictionary? Cant you break it up into multiple chunks? – Zenwalker Oct 24 '11 at 14:07
  • I would also recomment you to use database to store such a big dataset – Emmanuel N Oct 24 '11 at 14:08
  • What sort of keys, what sort of values? What sort of processing? It seems OOTB solutions aren't going to do the job here, but just what should be used instead can't be said without more details. – Jon Hanna Oct 24 '11 at 15:07
  • You may want to roll your own bitmap vector trie. These are meant for incredible amounts of data, and are built in a way that allows paging of various parts in and out, and more importantly, in a way that allows non-sequencial allocation. However, I should warn you: that is not an easy task. (Also, it's not guaranteed to actually work!) – configurator Oct 24 '11 at 16:41
  • 3
    Wait a minute, 13million keys isn't even that much - what are the values here? Arrays? – harold Oct 24 '11 at 17:40

8 Answers8

8

Buy more memory, install a 64 bit version of the OS and recompile for 64 bits. No, I'm not kidding. If you want so many objects... in ram... And then call it a "feature". If the new Android can require 16gb of memory to be compiled...

I was forgetting... You could begin by reading C# array of objects, very large, looking for a better way

You know how many are 13 million objects?

To make a comparison, a 32 bits Windows app has access to less than 2 gb of address space. So it's 2 billion bytes (give or take)... 2 billion / 13 million = something around 150 bytes/object. Now, if we consider how much a reference type occupies... It's quite easy to eat 150 bytes.

I'll add something: I've looked in my Magic 8-Ball and it told me: show us your code. If you don't tell us what you are using for the key and the values, how should we be able to help you? What are you using, class or struct or "primitive" types? Tell us the "size" of your TKey and TValue. Sadly our crystall ball broke yesterday :-)

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • The `Dictionary` is pretty memory-efficient, though. So, it's not unrealistic to fit 13M items in a dictionary into 2 GB of memory. But you're right that it's quite easy to get above the size where it doesn't fit anymore. – svick Oct 24 '11 at 15:13
  • @svick From what I'm seeing of the implementation of Dictionary, for big numbers it searches a prime > 2 * Count, so > 26 million. Let's say it's 26 million, it then creates an array of int[] of 26M and an array of structs containing 2x int, 1x TKey, 1x TValue. So of pure overhead it's 3x int * 26M = 12 * 26M = 300mb (I'm looking at Resize). If he is lucky and he is able to fill it before the doubling, it's still 150mb of overhead. – xanatos Oct 24 '11 at 15:25
  • Yeah, pretty much. Except that the prime is searched only when resizing, so it's actually a prime `> Count`. For 13M objects, the actual prime that it uses in my tests is 23,997,907. – svick Oct 24 '11 at 15:30
  • @svick The last prime "In table" in .NET 4.0 seems to be 7199369, so after that I would have thought he would have found something around 14 millions – xanatos Oct 24 '11 at 15:38
  • 1
    Actually, the second-to-last number in the table (~6M) is used, and then comes manual computing (i.e. approx. doubling). So 24M fits. Of course this all assumes not setting the default capacity. If the number of the elements is known, setting the capacity might help quite a lot here. – svick Oct 24 '11 at 15:54
7

Well, I had almost exactly the same problem.

I wanted to load about 12.5 million [string, int]s into a dictionary from a database (for all the programming "gods" above who don't understand why, the answer is that it is enormously quicker when you are working with a 150 GB database if you can cache a proportion of one of the key tables in memory).

It annoyingly threw an out of memory exception at pretty much the same place - just under the 12 million mark even though the process was only consuming about 1.3 GB of memory (reduced to about 800 MB of memory after a judicious change in db read method to not try and do it all at once) - despite running on an I7 with 8 GB of memory.

The solution was actually remarkably simple - in Visual Studio (2010) in Solution Explorer right click the project and select properties. In the Build tab set Platform Target to x64 and rebuild.

It rattles through the load into the Dictionary in a few seconds and the Dictionary performance is very good.

Brian Towers
  • 340
  • 12
  • 15
7

C# is not a language that was designed to solve heavy-duty scientific computation problems. It is absolutely possible to use C# to build tools that do what you want, but the off-the-shelf parts like Dictionary were designed to solve more common business problems, like mapping zip codes to cities and that sort of thing.

You're going to have to go with external storage of some sort. My recommendation would be to buy a database and use it to store your data. Then use a DataSet or some similar technology to load portions of the data into memory, manipulate them, and then pour more data from the database into the DataSet, and so on.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • 1
    You could simply create a DataSet (or even simple text file) and then persist the info to disk. You could then load only portions of the file into memory as needed. I doubt there is a need to load 13m keys and value to memory. – Jeff Reddy Oct 24 '11 at 15:15
  • 1
    Just use SQLite, much simpler IMHO – Tigran Oct 24 '11 at 17:23
0

Easy solution is just use simple DB. The most obvious solution in this case, IMHO is using SQLite .NET , fast, easy and with low memory footprint.

Tigran
  • 61,654
  • 8
  • 86
  • 123
  • As discussed in other comments/answers, using a DB isn't always an option. If you want to keep everything in memory for performance reasons, using a database will potentially kill your performance. – Joel B Aug 09 '16 at 15:44
0

Really 13000000 items are quite a lot. If 13000000 are allocated classes is a very deep kick into garbage collector stomach!

Also if you find a way to use the default .NET dictionary, the performance would be really bad, too much keys, the number of keys approaches the number of values a 31 bit hash can use, performance will be awful in whatever system you use, and of course, memory will be too much!

If you need a data structure that can use more memory than an hash table you probably need a custom hashtable mixed with a custom binary tree data structure. Yes, it is possible to write your own combination of two.

You cannot rely on .net hashtable for sure for this so strange and specific problem.

Consider that a tree have a lookup complexity of O(log n), while a building complexity of O(n * log n), of course, building it will be too long. You should then build an hashtable of binary trees (or viceversa) that will allow you to use both data structures consuming less memory.

Then, think about compiling it in 32 bit mode, not in 64 bit mode: 64 bit mode uses more memory for pointers. In the same time it i spossible the contrary, 32 bit address space may be is not sufficient for your problem. It never happened to me to have a problem that can run out 32 bit address space!

If both keys and values are simple value types i would suggest you to write your data structure in a C dll and use it through C#.

You can try to write a dictionary of dictionaries. Let's say, you can split your data into chunks of 500000 items between 26 dictionaries for example, but the occupied memory would be very very big, don't think your system will handle it.

public class MySuperDictionary
{
    private readonly Dictionary<KEY, VALUE>[] dictionaries;

    public MySuperDictionary()
    {
        this.dictionaries = new Dictionary<KEY, VALUE>[373]; // must be a prime number.
        for (int i = 0; i < dictionaries.Length; ++i)
            dictionaries[i] = new Dicionary<KEY, VALUE>(13000000 / dictionaries.Length);
    }

    public void Add(KEY key, VALUE value)
    {
        int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
        dictionaries[bucket].Add(key, value);
    }

    public bool Remove(KEY key)
    {
        int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
        return dictionaries[bucket].Remove(key);
    }

    public bool TryGetValue(KEY key, out VALUE result)
    {
        int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
        return dictionaries[bucket].TryGetValue(key, out result);
    }

    public static int GetSecondaryHashCode(KEY key)
    {
        here you should return an hash code for key possibly using a different hashing algorithm than the algorithm you use in inner dictionaries
    }
}
Salvatore Previti
  • 8,956
  • 31
  • 37
  • 13M items in the dictionary doesn't necessarily mean 13M objects that have to be GCed. That is, if both the keys and the values are value types (and are not boxed). And you're usually not limited by available RAM, but by available virtual memory. If you run out of physical RAM, the application may run terribly slow, but it will still run, not throw OOM exception. – svick Oct 24 '11 at 14:18
  • @svick If they're in a dictionary, then internally they are in structures on the heap. Besides, if they were value types on the stack then you'd run out of stack space even sooner. – Jon Hanna Oct 24 '11 at 15:04
  • Not really jon, .NET dictionary, at the contrary of java dictionary, uses only a constant number of 3 reference object, if i remember correctly. If key and value are value types they are stored internally in an array of value structures (that is a value type) and in an array of buckets (that is an array of integer). So .NET dictionary is very well written and fast. .NET garbage collector is smart enough to not look at pure value types for garbage collection. – Salvatore Previti Oct 24 '11 at 15:05
  • @JonHanna, yes, they are on the heap. But that doesn't mean GC has to track every one of them. GC tracks the whole dictionary (and the constant number of arrays it uses internally), but not every value in it (unless they are reference types, of course). – svick Oct 24 '11 at 15:09
0

I think that you need a new approach to your processing.

I must assume that you obtain the data from a file or a database, either way that is where it should remain.

There is no way that you may actually increase the limit on the number of values stored within a Dictionary, other than increasing system memory, but eitherway it is an extremely inefficient means of processing such a alarge amount of data.

You should rethink your algorithm so that you can process the data in more manageable portions. It will mean processing it in stages until you get your result. This may mean many hundreeds of passes through the data, but it's the only way to do it.

I would also suggest that you look at using generics to help speed up this repetitive processing and cut down on memory usage.

Remember that there will still be a balancing act between system performance and access to externally stored data (be it external disk store or database).

ChrisBD
  • 9,104
  • 3
  • 22
  • 35
0

It is not the problem with the Dictionary object, but the available memory in your server. I've done some investigation to understand the failures of dictionary object, but it never failed. Below is the code for your reference

    private static void TestDictionaryLimit()
    {
        int intCnt = 0;
        Dictionary<long, string> dItems = new Dictionary<long, string>();
        Console.WriteLine("Total number of iterations = {0}", long.MaxValue);
        Console.WriteLine("....");
        for (long lngCnt = 0; lngCnt < long.MaxValue; lngCnt++)
        {
            if (lngCnt < 11950020)
                dItems.Add(lngCnt, lngCnt.ToString());
            else
                break;
            if ((lngCnt % 100000).Equals(0))
                Console.Write(intCnt++);
        }
        Console.WriteLine("Completed..");
        Console.WriteLine("{0} number of items in dictionary", dItems.Count);
    }

The above code executes properly, and stores more than the number of count that you have mentioned.

  • 1
    It has nothing to do with available *physical memory*. It is all about available *address space*. Are you running this on a 32 bit process, with 2GB of address space, or a 64 bit process with a billion GB of address space? – Eric Lippert Oct 24 '11 at 15:25
  • @EricLippert It's only 8TB of address space, even on higher-end Windows Server :-D Don't give new users false hopes :-D – xanatos Oct 25 '11 at 07:54
-1

With that many keys, you should either use a database or something like memcache while swapping out chunks of the cache in storage. I'm doubting you need all of the items at once, and if you do, there's no way it's going to work on a low-powered machine with little RAM.

CassOnMars
  • 6,153
  • 2
  • 32
  • 47
  • Depending on the processing requirements, [MapReduce](http://en.wikipedia.org/wiki/MapReduce) may also be an option. – Oded Oct 24 '11 at 14:10