1

I inherited a Windows service that accepts requests via remoting to analyze potentially huge amounts of data in a database. It does this by retrieving the raw data by loading it into a dataset, parsing the data into a tree-like structure of objects, and then running the analysis.

The problem I am encountering is that if a very large set of data is analyzed, not all of the memory is returned to the system after the analysis is done, even if I aggressively force garbage collection. For example, if a 500MB set of data is analyzed, the windows service goes from ~10MB (the baseline at startup), to ~500MB, then down to ~200MB after GC.Collect(), and never goes any lower, even overnight. The only way to get the memory back down is to stop and restart the service. But if I run a small analysis, the service goes from ~10MB, to ~50MB, then down to something like ~20MB. Not great either, but there is a huge discrepancy between the final utilization between large and small data after the analysis is done.

This is not a memory leak per se because if I run the large analysis over and over, the total memory goes back down to ~200MB every time it completes.

This is a problem because the windows service runs on a shared server and I can't have my process taking up loads of memory all the time. It's fine if it spikes and then goes back down after the analysis is done, but this spikes and goes partially down to an unacceptably high number. A typical scenario is running an analysis and then sitting idle for hours.

Unfortunately, this codebase is very large and a huge portion of it is coded to work with datatables returned by a proprietary data access layer, so using an alternate method to load the data is not an option (I wish I could, loading all the data into memory just to loop over it makes no sense).

So my questions are:

1) Why does running a large dataset cause the memory utilization to settle back down to ~200MB, but running the small dataset causes the memory utilization to settle back down to ~20MB? It's obviously hanging on to pieces of the dataset somehow, I just can't see where.

2) Why does it make a difference if I loop over the data table's rows or not (see below)?

3) How can I get/force the memory back down to reasonable levels when the analysis is done, without radically changing the architecture?

I created a small windows service/client app to reproduce the problem. The test database I am using has a table with a million records, an int PK, and two string fields. Here's the scenarios I have tried -- the client (console app) calls LoadData via remoting ten times in a loop.

1) doWork = true, garbageCollect = true, recordCount = 100,000. Memory goes up to 78MB then stabilizes at 22MB.

2) doWork = false, garbageCollect = true, recordCount = 100,000. Memory goes up to 78MB and stabilizes at 19MB. Seriously, 3MB more to loop over the rows without doing anything?

3) doWork = false, garbageCollect = false, recordCount = 100,000. Memory goes up to about 178MB then stabilizes at 78MB. Forcing garbage collection is obviously doing something, but not enough for my needs.

4) doWork = false, garbageCollect = true, recordCount = 1,000,000. Memory goes up to 500MB and stabilizes at 35MB. Why does it stabilize at a higher number when the dataset is larger?

5) doWork = false, garbageCollect = true, recordCount = 1,000. It runs too fast to see the peak but it stabilizes at a measly 12MB.

public string LoadData(bool doWork, bool garbageCollect, int recordCount)
{
    var dataSet = new DataSet();

    using (var sqlConnection = new SqlConnection("...blah..."))
    {
        sqlConnection.Open();

        using (var dbCommand = sqlConnection.CreateCommand())
        {
            dbCommand.CommandText = string.Format("select top {0} * from dbo.FakeData", recordCount.ToString());
            dbCommand.CommandType = CommandType.Text;

            using (var dbReader = new SqlDataAdapter(dbCommand))
            {
                dbReader.Fill(dataSet);
            }                    
        }

        sqlConnection.Close();
    }

    // loop over the records

    var count = dataSet.Tables[0].Rows.Count;

    if (doWork)
    {
        foreach (DataRow row in dataSet.Tables[0].Rows) {}
    }

    dataSet.Clear();
    dataSet = null;

    if (garbageCollect)
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();
    }

    return string.Format("Record count is {0}", count);
}
Paul Abbott
  • 7,065
  • 3
  • 27
  • 45
  • 1
    Have you read this thread: http://stackoverflow.com/questions/1343374/reducing-memory-usage-of-net-applications ? What value are you using to determine your application is using xxxMB of memory? Are you using a memory profiling tool? – John Koerner Feb 15 '14 at 04:47
  • 1
    Also, the accepted answer in this thread has some good insights: http://stackoverflow.com/questions/223283/net-exe-memory-footprint – John Koerner Feb 15 '14 at 04:48
  • OK, so it sounds like I'm worrying over nothing. – Paul Abbott Feb 15 '14 at 16:50

0 Answers0