5

What are the memory limitations of a Hashset<string> in C#?

I've seen that .NET has a memory limit of 2Gb per object? Is this information still accurate ? Does it apply for Hashsets?

I'm currently working on an application that works with a large hashset and I've seen that as soon as I build the dll's for 64 bit environment I get OutOfMemory only when my 8GB RAM laptop reaches its memory limits.

If I would of had 16Gb RAM would the object increase until it reaches the hardware limitations?

Community
  • 1
  • 1
doremifasolasido
  • 428
  • 5
  • 17
  • Duplicate? http://stackoverflow.com/a/1088044/993547 – Patrick Hofman Oct 31 '16 at 11:17
  • 1
    The 2GB limitation is for single objects and will impact the maximum size of arrays as an example. However, if the `T`'s you've stored in a hashset are classes, then only a 32- or 64-bit reference is stored in the hashset, the actual object instance and its size won't matter in context of the hashset. OutOfMemory in general means that .NET really ran out of memory, it should never mean that some arbitrary object decided that this is as high as it can go. – Lasse V. Karlsen Oct 31 '16 at 11:22
  • The 2GB limit is not that simple any more; there is an `gcAllowVeryLargeObjects` option - but the `int.MaxValue` limit still applies even if that is enabled; in the case of `T`=`string`, you might be able to get a bit larger, though - **if** `HashSet` is bounded by large arrays! not trivial – Marc Gravell Oct 31 '16 at 11:22
  • In other words, `HashSet` doesn't throw `OutOfMemoryException`, this is .NET throwing this exception when a `new X(...)` cannot be satisfied. – Lasse V. Karlsen Oct 31 '16 at 11:23
  • 1
    @LasseV.Karlsen it could legitimately be throwing because some backing internal structure can't be resized; I don't know how `HastSet` is implemented, but it is feasible – Marc Gravell Oct 31 '16 at 11:24
  • @MarcGravell The 2GB limit, for an array, on a 64-bit system, is this still "2GB / element-size" max number of elements? I see the `Slot[]` array inside `HashSet` which contains two ints + `T`, wouldn't this mean that the maximum *number* of elements that can be stored in a `HashSet` is really just `2GB / (8+sizeof(T'))` if we disregard struct packing (`Slot` is a struct). – Lasse V. Karlsen Oct 31 '16 at 11:24
  • @LasseV.Karlsen depends; if `gcAllowVeryLargeObjects` is enabled, it can be > 2GB – Marc Gravell Oct 31 '16 at 11:26
  • @MarcGravell Yes, obviously things you can do in `HashSet` can result in a `OutOfMemoryException`, but not because code in `HashSet` checked against some limit and then decided to throw that exception, .NET really ran out of memory. – Lasse V. Karlsen Oct 31 '16 at 11:26
  • OK, then I must retract some of my comments above, I need to go update my knowledge it seems, thanks @MarcGravell. – Lasse V. Karlsen Oct 31 '16 at 11:27
  • I was kinda wondering about this comment in the documentation: "For very large `HashSet` objects, you can increase the maximum capacity to 2 billion elements on a 64-bit system by setting the enabled attribute of the configuration element to true in the run-time environment.". Kinda silly of them to not really write *which* setting to enable? – Lasse V. Karlsen Oct 31 '16 at 11:27
  • I retract all my comments but will leave them be to avoid causing comment strangeness. I see now that trying to do `new T[too-large-for-2gb]` also throws `OutOfMemoryException` even though memory-wise such an array has more than enough memory available in a 64-bit process. – Lasse V. Karlsen Oct 31 '16 at 11:34
  • 2
    Pretty hard to envision a scenario where you run out of HashSet before you run out of address space to store the strings. You'd have to store small strings, pretty hard to keep them unique. If the average string length is 10 chars then you'd need 10 gigabytes for just the strings. Very rough on the GC btw. Available address space in a 64-bit program is limited by the maximum pagefile size and how fast it can grow. – Hans Passant Oct 31 '16 at 11:34
  • From the answer below I understand everything apart from "Also note that it's possible to configure an application to allow arrays larger than 2GB in size - although you can still only have 2GB of elements in a single dimensional array. " ??? – doremifasolasido Oct 31 '16 at 12:15
  • @LaRage What I meant by that is that you can't have more than 2G elements in an array, but because the size of each element can be a number of bytes, the total number of bytes exceeds 2GB. – Matthew Watson Oct 31 '16 at 12:36
  • @HansPassant Another thing ... so in my case the majority of the RAM space is taken by the strings itself, not the Hashset object that keeps only the references ? Is it correct if I state that If I would be to save the string in a DB and map them with NHibernate I would be able to keep more strings in my object but take in to account the R/W DB overhead ? – doremifasolasido Oct 31 '16 at 14:39
  • 1
    The disk is *always* the workaround for programs that need excessive amounts of memory. Whether that is the paging file or a flat file or a dbase doesn't actually make any difference. – Hans Passant Oct 31 '16 at 14:45

1 Answers1

6

There is a 2GB limit per object, but remember that a reference type only uses the pointer size (8 bytes for x64) when it's a field in a class.

Array memory sizes are computed as follows (ignoring fixed overhead):

For arrays of struct types:

  • Array memory size = #elements in the array * size of each element

For arrays of reference types:

  • Array memory size = #elements in the array * reference size (4 bytes for x8x, 8 bytes for x64)

So a HashSet could reference objects totalling a lot more than the 2GB limit. It's just that if you add up the size taken by each field in the class - 64 bits for reference types, and the full size for struct types - it must be less than 2GB.

You could have a class that contained 16x1GB arrays of bytes, for instance.

Also note that it's possible to configure an application to allow arrays larger than 2GB in size - although the maximum number of elements in a single dimensional array still cannot exceed 2G (2*1024*1024*1024).

I suspect that the objects that you are storing in the HashSet are reference types, so it's only using 64 bits for each one in the internal HashSet array, while the full size of each of your objects is much larger than 64 bits - which gives a total size in excess of 2GB.

Looking at the referencesource for HashSet shows that the following arrays are used:

private int[] m_buckets;
private Slot[] m_slots;

Where Slot is defined like so:

internal struct Slot {
    internal int hashCode;      // Lower 31 bits of hash code, -1 if unused
    internal T value;
    internal int next;          // Index of next entry, -1 if last
}

It looks like each Slot struct occupies 24 bytes on x64 when T is a reference type, which means that HashSet will throw OutOfMemory when the number of slots in use exceeds 2GB/24 = 85M elements

(If T is a struct then depending on its size you'll run out of memory a lot sooner.)

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • @LaRage Since you are storing strings, you would be able to store ~128*1024*1024 of them before getting OutOfMemory. If you enable large object, I *think* you'd be able to store ~2*1024*1024*1024 of them before getting an exception due to having too many elements. But you would need to test this out to be definite! I don't really know for sure. – Matthew Watson Oct 31 '16 at 12:34
  • 1
    Pardon my ignorance, but on x64, should the Slot struct not take 24 bytes? Unlike classes, structs are not packed/reorderd for compactness afaik. – mafu Sep 08 '19 at 18:57
  • @mafu You're right - it should be 24 bytes. I'll update the answer. – Matthew Watson Sep 09 '19 at 09:06
  • 2
    By the way, this is ridiculously low and is 100% Microsoft's fault for having another "640K ought to be enough for anybody" moment with this – A X Sep 27 '20 at 20:18