2

I am using Linq to Sql. Here is the code:

Dictionary<string, int> allResults;
using (var dc= new MyDataContext())
{
    dc.CommandTimeout = 0;
    allResults = dc.MyTable.ToDictionary(x => x.Text, x => x.Id);
}

it is ran on a 64 bit machine and the compilation is AnyCPU. It throws a System.OutOfMemoryException.

This accesses an SQL Server database. The Id field maps to an SQL Server int field, And the Text field maps to Text(nvarchar(max)) field. Running select COUNT(*) from TableName results in 1,173,623 records and running select sum(len(Text)) from TableName results in 48,915,031. Since int is a 32 bit integer, the ids should take only 4.69MB of space and the strings less than 1GB. So we are not even bumping against the 2GB/object limit.

I then change the code in this way:

Dictionary<string, int> allResults;
using (var dc = new MyDataContext())
{
   Dictionary<string, int> threeHundred;
   dc.CommandTimeout = 0;
   var tot = dc.MyTable.Count();
   allResults = new Dictionary<string, int>(tot);
   int skip = 0;
   int takeThis = 300000;
   while (skip < tot)
   {
      threeHundred = dc.MyTable.Skip(skip).Take(takeThis).ToDictionary(x => x.Text, x => x.Id);
      skip = skip + takeThis;
      allResults = allResults.Concat(threeHundred).ToDictionary(x => x.Key, x => x.Value);
      threeHundred = null;
      GC.Collect();
    }
}

I learn that garbage collaction here does not help and that the out of memory exception is thrown on the first line in the while loop once skip = 900,000.

What is wrong and how do I fix this?

Barka
  • 8,764
  • 15
  • 64
  • 91
  • 1
    Do you actually have enough memory to hold all of those at once? – SLaks Apr 28 '14 at 18:29
  • Garbage collection cannot possibly help when you're trying to hold everything at once. Your second code block simply creates extra objects, giving the GC more work to do, without making the actual dictionary any smaller. – SLaks Apr 28 '14 at 18:30
  • How many rows are in your database table? – Martin Costello Apr 28 '14 at 18:31
  • @SLaks, thanks. That makes sense.The point with the garbage collection is to de-allocate the memory threeHundred took, and the point of the revised code was to find at what point in memory usage the exception is thrown. – Barka Apr 28 '14 at 18:34
  • @Slacks, I have plenty of RAM on the machine that is never used and of course there is virtual memory also. – Barka Apr 28 '14 at 18:35
  • @martin_costello, 1.17 million – Barka Apr 28 '14 at 18:36
  • @user277498 That doesn't sound *too* unreasonable, but it's still quite a lot to operate on at once. Have you tried creating the dictionary yourself and adding the items into it one by one by just using a `foreach` loop and only selecting the database columns you need? For example `foreach (var item in dc.MyTable.Select((p) => new { Text = p.Text, Id = p.Id })) { ... }`. – Martin Costello Apr 28 '14 at 18:40
  • 2
    Side note: Make sure your process runs as x64 - by default AnyCPU runs x86 ("prefer 32-bit"). – Alexei Levenkov Apr 28 '14 at 18:40
  • Does this apply? http://stackoverflow.com/questions/934314/maximum-memory-a-net-process-can-allocate – lcryder Apr 28 '14 at 18:42
  • If you're using .net 4.0 or lower, I think there's a memory cap at 2 gigabytes per object, regardless of the architecture you're targeting. I'll see if I can't dig out a link... – sous2817 Apr 28 '14 at 18:43
  • There is a 2GB-per-object limit: http://stackoverflow.com/questions/2415434/the-limitation-on-the-size-of-net-array – Joe Apr 28 '14 at 18:44
  • Do you really want the text as the key and not the id? Why do you need to hold everything in memory and why do you need a dictionary? – Magnus Apr 28 '14 at 18:52
  • @martin_costello, I am taking only two columns from the database. A text column and an int. And the text column while variable in size is not that big. Doesn't C# convert the query I wrote to what you suggest at compile time anyway? – Barka Apr 28 '14 at 22:54
  • @AlexeiLevenkov, not according to: http://msdn.microsoft.com/en-us/library/zekwfyz4.aspx. However, I will try as you suggest. – Barka Apr 28 '14 at 22:58
  • @sous2817, I am using .NET 4.5 – Barka Apr 28 '14 at 22:59
  • 1
    @Joe, See my note to sous2817 above. However you two might be on to something. I think I need to change my config file: By default, the .NET Framework does not support arrays that are greater than 2 GB, but in the .NET Framework 4.5, you can use the element in your application configuration file to enable arrays that are greater than this size. Source: http://blogs.msdn.com/b/dotnet/archive/2012/07/20/the-net-framework-4-5-includes-new-garbage-collector-enhancements-for-client-and-server-apps.aspx – Barka Apr 28 '14 at 23:07
  • @magnus, I want to quickly look up the id given the text without implementing redis or hitting SQL server over and over again. Dictionary allows that with O(1) – Barka Apr 28 '14 at 23:55
  • You can add a full text index on the column in the database and look it up there instead. It might be an option. – Magnus Apr 29 '14 at 06:55
  • I'm really surprised nobody said... since `Text` is `nvarchar(max)` have you seen how much data is in this field for each row? I single row could be 2gb! – Erik Philips Apr 29 '14 at 23:01
  • @ErikPhilips, I added that later and the folks who commented before did not see it. However, read the rest of that paragraph and according to my calculations, the whole thing should be under 1G: select sum(len(Text)) from TableName results in 48,915,031. Unless I am doing my math wrong. – Barka Apr 30 '14 at 17:17
  • 1
    @Magnus, access to in memory collections is always faster than access to any external data source no matter how well indexed it is. My interest is in figuring out what goes wrong here and how to fix it more than re-engineering the product to avoid this error. – Barka Apr 30 '14 at 17:19
  • 1
    The 2GB/object does not apply here anyway, the Dictionary is a big cluster of objects, each well below that limit. A memory profiler might help. – H H Apr 30 '14 at 17:42

1 Answers1

1

Without getting into your calculations of how much it should take in memory (as there could be issues of encoding that could easily double the size of the data), I'll try to give a few pointers.

Starting with the cause of the issue - my guess is that the threeHundred dictionary is causing a lot of allocations. When you add items to a dictionary like above, the dictionary won't be able to know how many items it should pre-allocated. Which will cause a massive re-allocation and coping of all data to newly created dictionaries. Please set a size (using the ctor) to the threeHundred dictionary before adding any items to it.

Please read this article I've published which goes in-depth into Dictionary internals - I'm sure it will shed some light on those symptoms. http://www.codeproject.com/Articles/500644/Understanding-Generic-Dictionary-in-depth

In addition, when trying to populate this large amount of data, I suggest to fully control the process. My suggestion:

  • Pre-allocate slots in the Dictionary (using a Count query directly on the DB, and passing it to the Dictionary ctor)
  • Work with DataReader for populating those items without loading all of the query result into memory. If you know for a fact (which is VERY important to know this in advance) - think of using string.Intern - only if there are many duplicated items! - you should test to see how it is working
  • Memory-profile the code - you should only see ONE allocation for the Dictionary, and strings as the amount of the items from the query (int is value type - therefor it is not allocated on the heap as an object, but instead it sits inside the Dictionary.

Either way, you should check if you are running on 32 bit or 64 bit. .Net 4.5 prefers 32 bit. (check it on Task Manager or the project properties)

Hope this helps, Ofir.

Ofir Makmal
  • 501
  • 2
  • 3