15

I'm processing 1 million records in my application, which I retrieve from a MySQL database. To do so I'm using Linq to get the records and use .Skip() and .Take() to process 250 records at a time. For each retrieved record I need to create 0 to 4 Items, which I then add to the database. So the average amount of total Items that has to be created is around 2 million.

IQueryable<Object> objectCollection = dataContext.Repository<Object>();
int amountToSkip = 0;
IList<Object> objects = objectCollection.Skip(amountToSkip).Take(250).ToList();
while (objects.Count != 0)
        {
            using (dataContext = new LinqToSqlContext(new DataContext()))
            {
                foreach (Object objectRecord in objects)
                {
                    // Create 0 - 4 Random Items
                    for (int i = 0; i < Random.Next(0, 4); i++)
                    {
                        Item item = new Item();
                        item.Id = Guid.NewGuid();
                        item.Object = objectRecord.Id;
                        item.Created = DateTime.Now;
                        item.Changed = DateTime.Now;
                        dataContext.InsertOnSubmit(item);
                    }
                }
                dataContext.SubmitChanges();
            }
            amountToSkip += 250;
            objects = objectCollection.Skip(amountToSkip).Take(250).ToList();
        }

Now the problem arises when creating the Items. When running the application (and not even using dataContext) the memory increases consistently. It's like the items are never getting disposed. Does anyone notice what I'm doing wrong?

Thanks in advance!

Bas
  • 1,232
  • 1
  • 12
  • 26
  • How do you initialize your objectCollection ? – Francisco Soto Apr 28 '10 at 07:49
  • IQueryable objectCollection = dataContext.Repository(); – Bas Apr 28 '10 at 07:55
  • Why do you take batches of 250 instead of just iterating over objectCollection? – Jens Apr 28 '10 at 07:59
  • 4
    Actually, his approach to pulling is nice. Just iterating would also involve pulling the data in chunks and how it's paged is up to the Linq implementation. Also you get the benefit of having to update exactly 250 items per run instead of some arbitrary number and relying on the framework to properly batch the stuff. – Tigraine Apr 28 '10 at 08:02
  • I would get a Sql Timeout exception, it takes alot of time to get 1 million records from the database that I'm using. – Bas Apr 28 '10 at 08:02
  • @Jens - committing 1 mil records in a single batch might not be so pretty. – devnull Apr 28 '10 at 08:02
  • Any chance you can run dotTrace or some other profiler over the code? That should show you pretty fast what the problem is. – Tigraine Apr 28 '10 at 08:05
  • Did you try transactioning the inserts? – Sorin Comanescu Apr 28 '10 at 08:24

5 Answers5

7

Ok I've just discussed this situation with a colleague of mine and we've come to the following solution which works!

int amountToSkip = 0;
var finished = false;
while (!finished)
{
      using (var dataContext = new LinqToSqlContext(new DataContext()))
      {
           var objects = dataContext.Repository<Object>().Skip(amountToSkip).Take(250).ToList();
           if (objects.Count == 0)
                finished = true;
           else
           {
                foreach (Object object in objects)
                {
                    // Create 0 - 4 Random Items
                    for (int i = 0; i < Random.Next(0, 4); i++)
                    {
                        Item item = new Item();
                        item.Id = Guid.NewGuid();
                        item.Object = object.Id;
                        item.Created = DateTime.Now;
                        item.Changed = DateTime.Now;
                        dataContext.InsertOnSubmit(item);
                     }
                 }
                 dataContext.SubmitChanges();
            }
            // Cumulate amountToSkip with processAmount so we don't go over the same Items again
            amountToSkip += processAmount;
        }
}

With this implementation we dispose the Skip() and Take() cache everytime and thus don't leak memory!

Bas
  • 1,232
  • 1
  • 12
  • 26
6

Ahhh, the good old InsertOnSubmit memory leak. I've encountered it and bashed my head against the wall many times when trying to load data from large CVS files using LINQ to SQL. The problem is that even after calling SubmitChanges, the DataContext continues to track all objects that have been added using InsertOnSubmit. The solution is to SubmitChanges after a certain amount of objects, then create a new DataContext for the next batch. When the old DataContext is garbage collected, so will all the inserted objects that are tracked by it (and that you no longer require).

"But wait!" you say, "Creating and disposing of many DataContext will have a huge overhead!". Well, not if you create a single database connection and pass it to each DataContext constructor. That way, a single connection to the database is maintained throughout, and the DataContext object is otherwise a lightweight object that represents a small work unit and should be discarded after it is complete (in your example, submitting a certain number of records).

Allon Guralnek
  • 15,813
  • 6
  • 60
  • 93
  • Ehh I've said in my question that even when im NOT using the DataContext im getting this leak, so it's not bound to InsertOnSubmit or SubmitChanges (I already tested that) AND It's best practice to use DataContext in a using block. DataContextes are lightweigth and are meant to be recreated alot (see: http://stackoverflow.com/questions/123057/how-do-i-avoid-a-memory-leak-with-linq-to-sql). I've already tried using 1 DataContext to do all transactions, that was even worse. – Bas Apr 28 '10 at 08:41
  • You're repeating what I said - use a new DataContext for each small work unit (inside a `using` statement, of course). And how exactly did you test your example without a `DataContext`? Where did you get the `objects` collection from? – Allon Guralnek Apr 28 '10 at 08:49
  • Sorry I meant without the InsertOnSubmit and the SubmitChanges calls ;] My bad. At first I also thought the InsertOnSubmit and SubmitChanges were the problem, after fixing this and doing a second run I still got the leak. The leak is because of the Skip and Take which chaches all the retrieved items and never disposes it automatically while running. So eventually I had 2 million Items in a chached list. – Bas Apr 28 '10 at 08:56
  • 1
    Yes, that is the exact problem I described in my answer. BTW, in my case even after solving the memory leak, it still wasn't performant enough, so I wrote a CLR Stored Procedure in C#, which ran about 200 times faster (three minutes instead of ten hours loading 7.3 million records). – Allon Guralnek Apr 28 '10 at 08:56
  • Wow ok, that sounds awesome ^^ I've solved the memory but I have to agree with you, performance isn't something to really cheer about. Pretty nice that you could solve that with writing a Stored Procedure. – Bas Apr 28 '10 at 09:09
2

My best guess here would be the IQueryable to cause the Memory leak. Maybe there is no appropriate implementation for MySQL of the Take/Skip methods and it's doing the paging in memory? Stranger things have happened, but your loop looks fine. All references should go out of scope and get garbage collected ..

Tigraine
  • 23,358
  • 11
  • 65
  • 110
0

Well yeah.

So at the end of that loop you'll attempt to have 2 million items in your list, no? Seems to me that the answer is trivial: Store less items or get more memory.

-- Edit:

It's possible I've read it wrong, I'd probably need to compile and test it, but I can't do that now. I'll leave this here, but I could be wrong, I haven't reviewed it carefully enough to be definitive, nevertheless the answer may prove useful, or not. (Judging by the downvote, I guess not :P)

Noon Silk
  • 54,084
  • 6
  • 88
  • 105
  • No he has a database with 2 million entries, takes them 250 at a time and then adds 4 new sub-entries into the database. In memory there is no list of any kind.. Read the question.. – Tigraine Apr 28 '10 at 08:04
  • @Tigraine I think it's the skip that's screwing him. That's my guess, anyway. – Noon Silk Apr 28 '10 at 08:09
  • The cache, in the list objectCollection, is the bottleneck. It doesn't get disposed automatically – Bas Apr 28 '10 at 08:44
0

Have you tried declaring the Item outside the loop like this:

IQueryable<Object> objectCollection = dataContext.Repository<Object>();
int amountToSkip = 0;
IList<Object> objects = objectCollection.Skip(amountToSkip).Take(250).ToList();
Item item = null;
while (objects.Count != 0)
        {
            using (dataContext = new LinqToSqlContext(new DataContext()))
            {
                foreach (Object objectRecord in objects)
                {
                    // Create 0 - 4 Random Items
                    for (int i = 0; i < Random.Next(0, 4); i++)
                    {
                        item = new Item();
                        item.Id = Guid.NewGuid();
                        item.Object = objectRecord.Id;
                        item.Created = DateTime.Now;
                        item.Changed = DateTime.Now;
                        dataContext.InsertOnSubmit(item);
                    }
                }
                dataContext.SubmitChanges();
            }
            amountToSkip += 250;
            objects = objectCollection.Skip(amountToSkip).Take(250).ToList();
        }
edosoft
  • 17,121
  • 25
  • 77
  • 111
  • Have tried this, didnt work >.< I think, like Tigraine said, It's because of the IQueryable and the Take/Skip... – Bas Apr 28 '10 at 08:06