Is there an IDictionary implementation for large dictionaries not to end up on large object heap?

Question

This is closely related to .NET Collections and the Large Object Heap (LOH). In a nutshell, if there are more than 85K buckets, it's automatically on LOH and when it's released is unknown. Does anyone aware of a good implementation of IDictionary based on lists of array or something like it that prevents it from going to LOH?

The fact that large collections are on the LOH is a good thing, not something you should try to avoid. What do you mean “when it's released is unknown”? Objects on the LOH are treated the same as other objects in gen 2 by the GC. — svick, Oct 12 '12 at 17:57
@phoog That's exactly why is it a good thing, moving the whole collection possibly every time full GC is performed could affect the performance. If you're worried about fragmentation, you should have good data that that's actually a problem for you. — svick, Oct 12 '12 at 18:01
@svick we encountered a problem recent in some app that would create a huge arrays and the machine would start swapping eventually because LOH was not freed up frequently enough. — Schultz9999, Oct 12 '12 at 18:01
@Schultz9999 Then maybe you should use something like [object pool](http://en.wikipedia.org/wiki/Object_pool_pattern) instead. I'm not sure allocating lots of small objects instead is such a good idea. — svick, Oct 12 '12 at 18:03
@svick I didn't have good data, just a hunch (mea culpa) but switching dictionary types made my program go from 12 hours to 3 or 4 with a huge data set. Smaller sets take anywhere from 1 minute to an hour, and these times increased by a few percent only. — phoog, Oct 12 '12 at 18:04
@Schultz9999 note that the cutoff for the LOH is measured in *bytes*; each bucket takes 4 or 8 bytes, depending on the operating system. — phoog, Oct 12 '12 at 18:08
@phoog That's certainly interesting. It would be interesting to see a detailed analysis of why it helped in your case. (Though I understand you're not going to do that.) — svick, Oct 12 '12 at 18:10
@svick true, but I can give you the non-detailed analysis. The program is extremely inefficient with object allocations, especially strings and large lists. Also, the record types wrap every value in an object, so a record with 10 numeric fields requires 11 objects instead of 1; with 10 string fields you need 21 objects instead of 11... — phoog, Oct 12 '12 at 18:16
@svick ... As the dictionaries fill up, the GC runs more frequently. The object graph also gets larger, so GC takes longer. (I didn't design this, I inherited it.) To fix this, we tried freeing memory by deleting values we no longer needed from the dictionaries and moving the remaining data to new smaller dictionaries. This didn't help because of LOH fragmentation, but with the SortedDictionary it helped quite well. — phoog, Oct 12 '12 at 18:17
@svick: The fact that big things end up on the LOH is only a good thing if the things in question are long-lived. There are many scenarios where one may create and abandon many "large" objects (and 85K isn't really all that big). The requirement that such objects must live until the next Gen2 collection doesn't seem like a "good" thing to me. And the special rule about arrays of `double` is truly mind-boggling. Why not align all Gen0 objects on 16-byte boundaries, and arrange all Gen1 and Gen2 arrays so that arrays of 8-byte items are 8-byte aligned, and 16-byte items are 16-byte aligned? — supercat, Jan 07 '13 at 21:04
@svick: Be aware that if one is stuck in a 32-bit process, having objects on LOH is very bad. Both of the large-scale .Net projects I was involved in for years expended substantial effort minimizing LOH usage, and adding max-generation GC calls, until those projects were finally rid of legacy 32-bit dependencies (or moved those dependencies to a separate process). REASON: Fragmentation of memory. Would get OutOfMemory, with half a Gigabyte still free, but unable to be used for a large dictionary or array. It was the single most frustrating aspect of early .Net usage. — ToolmakerSteve, May 01 '14 at 03:28

score 4 · Answer 1 · answered Oct 12 '12 at 17:47

4

You can use the SortedDictionary, which is a binary tree.

If you need the Dictionary's O(1) performance, or something closer to that, you could use a different hash table implementation that stores the data in chunks small enough not to go on the LOH. I'm not aware of anything publicly available; I have used SortedDictionary in the past and found that the decrease in performance was minimal, so I didn't look any further.

answered Oct 12 '12 at 17:47

phoog

42,068
6
79
117

1

One caveat: `SortedDictionary` requires ordering, normal `Dictionary` doesn't. – svick Oct 12 '12 at 18:08
2

@svick well, that's the essence of the question actually. Using SortedDictionary is off-shelf solution. May not be the best but something we can use right away. Ideally, I think it call can be implemented the way how Servy presented -- thru partitioning. – Schultz9999 Oct 12 '12 at 18:28

Servy · Accepted Answer · 2014-03-10T21:17:03.843

Here's a start of one option. I assume you can follow the pattern given to implement the other methods.

Just change the numDictionaries to determine how it's broken up.

If you really need to you could make the number of dictionaries dynamic and have it add more when the existing ones get sufficiently large.

public class NonContigousDictionary<TKey, TValue>
//TODO make this implement IEnumerable, IDictionary, 
//and any other relevant interfaces.
{
    public Dictionary<TKey, TValue>[] dictionaries;

    private readonly int numDictionaries = 5;
    public NonContigousDictionary()
    {
        dictionaries = Enumerable.Range(0, numDictionaries)
            .Select(_ => new Dictionary<TKey, TValue>())
            .ToArray();
    }

    public TValue this[TKey key]
    {
        get
        {
            int hash = key.GetHashCode();
            return dictionaries[GetBucket(hash)][key];
        }
        set
        {
            int hash = key.GetHashCode();
            dictionaries[GetBucket(hash][key] = value;
        }
    }

    public bool Remove(TKey key)
    {
        int hash = key.GetHashCode();
        return dictionaries[GetBucket(hash].Remove(key);
    }

    public void Clear()
    {
        foreach (var dic in dictionaries)
        {
            dic.Clear();
        }
    }

    private int GetBucket(int hash)
    {
        return (hash % numDictionaries + numDictionaries) % numDictionaries;
    }
}

Good stuff. The only comment is that I can't simply add more dictionaries -- I will have to rehash everything. Unless I implement something close to consistent hashing. — Schultz9999, Oct 12 '12 at 17:59
I think you want "hash % numDictionaries". Also, you're not accounting for possible negative hash codes. — Tim, Mar 10 '14 at 21:13

Is there an IDictionary implementation for large dictionaries not to end up on large object heap?

2 Answers2