5

I´m loading csv data from files into a datatable for processing.

The problem is, that I want to process several files and my tests with the datatable shows me huge memory consumption I tested with a 37MB csv file and the memory growed up to 240MB, which is way to much IMHO. I read, that there is overhead in the datatable and I could live with about 70MB in size , but not 240MB, which means it is six times the original size. I read here, that datatables need more memory than POCOs, but that the difference is way too much.

I put on a memory profiler and looked, if I have memory leaks and where the memory is. I found, that the datatablecolumns have between 6MB and 19MB filled with strings and the datatable had about 20 columns. Are the values stored in the columns? Why is so much memory taken, what can I do to reduce memory consumption. With this memory consumption datattables seem to be unusable.

Had somebody else such problems with datatables, or I´m doing something wrong?

PS: I tried a 70MB file and the datatable growed up to 500MB!

OK here is a small testcase: The 37MB csv-file (21 columns) let the memory grow up to 179MB.

    private static DataTable ReadCsv()
    {
        DataTable table = new DataTable();
        table.BeginLoadData();

        using (var reader = new StreamReader(File.OpenRead(@"C:\Develop\Tests\csv-Data\testdaten\test.csv")))
        {               
            int y = 0;
            int columnsCount = 0;
            while (!reader.EndOfStream)
            {
                var line = reader.ReadLine();
                var values = line.Split(',');

                if (y == 0)
                {
                    columnsCount = values.Count();
                    // create columns
                    for (int x = 0; x < columnsCount; x++)
                    {
                        table.Columns.Add(new DataColumn(values[x], typeof(string)));
                    }
                }
                else
                {
                    if (values.Length == columnsCount)
                    {
                        // add the data
                        table.Rows.Add(values);
                    }
                }

                y++;
            }

            table.EndLoadData();
            table.AcceptChanges();

        }

        return table;
    }
JohnnyBravo75
  • 253
  • 1
  • 2
  • 14
  • you want us to guess or will you show some code for us to take a look? – Ehsan Aug 26 '13 at 16:43
  • I guess there must be some other objects in your code? – King King Aug 26 '13 at 16:45
  • 1
    Memory is cheap but: Maybe this help you find a more memory efficient solution: http://stackoverflow.com/questions/275269/does-a-datatable-consume-more-memory-than-a-listt – dcaswell Aug 26 '13 at 16:45
  • 1
    @user814064 **Memory is cheap**? really? Today we currently use computers of `8GB RAM and more`, that's true but there are still many others using computers of `2GB RAM and less`. I don't think an application consuming `2xx` MB RAM is a trivial problem if the actual data is just `2x` MB. – King King Aug 26 '13 at 16:49
  • 1
    Yes really. Nobody here has started that they're are encountering any cost. I can use 256 megabytes of memory for a few seconds (or a few minutes) on my computer without seeing a huge cost. And I did provide a link to helpful post that explains why datatables have a large memory footprint. – dcaswell Aug 26 '13 at 17:06

2 Answers2

6

DataSet and its children DataTable, DataRow, etc. make up an in-memory relational database. There is a lot of overhead involved (though it does make [some] things very convenient.

If memory is an issue,

  • Build domain objects to represent each row in your CSV file with typed properties.
  • Create a custom collection (or just use IList<T> to hold them
  • Alternatively, build a light-weight class with the basic semantics of a DataTable:
    • the ability to select a row by number
    • the ability to select a column within a row by row number and either column name or number.
    • The ability to know the ordered set of column names
    • Bonus: The ability to select a column by name or ordinal number and receive a list of its values, one per row.

Are you sure you need an in-memory representation of your CSV files? Could you access them via an IDataReader like Sebastien Lorion's Fast CSV Reader?

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
  • Yes, but a factor of 6-7 cannot be. There must be a problem in my code otherwise the datatables are unusable crap. Thats why I asked for statistical/experienced values from other users to compare. No database has such an overhead. If I put them in a Sqlite it does not get so big. I wrote my own csv reader, AdoAdapter, XmlReader and designed the application to work with datables in memory, and with a calcaulation of factor two it would work. But this overhead will destroy my design. What is the experience from other users? – JohnnyBravo75 Aug 26 '13 at 22:25
0

DataTables are a generic solution of putting tablular data into memory and adding lots of table-related features. If the overhead is not acceptable for you have the option to 1) write your own DataTable class that eliminates the overhead that you don't need 2) Use an alternate representation that still accomplishes what you need, perhaps POCO based, or maybe an XMLDocument (May have just as much overhead maybe more, never really worried about it). 3) Stop trying to load everything into memory and just bring data in as needed from your external store.

Gary Walker
  • 8,831
  • 3
  • 19
  • 41
  • 1 Alread tried the Silverlight datatable from Telerik (http://blogs.telerik.com/vladimirenchev/posts/09-04-23/lightweight-datatable-for-your-silverlight-applications.aspx), this is NOT lightweight, it consumes much more memory. The problem is if you want edit capabilities it cannot be lightweight. Otherwise using a list of dictionary is more lightweight this is my only hope. – JohnnyBravo75 Aug 26 '13 at 22:34
  • 2. POCO based is not alternative, because I have no special data. Every data has another structure. 3. Yes thats what I´m evaluating. The problem is, that I need to process the data, build key columns/indexes, sort them.... Perhaps a lightweight database like sqlite is the way to go. – JohnnyBravo75 Aug 26 '13 at 22:34
  • That sounds a lot like my option 3 -- There are a couple of database options. A lightweight may be a good choice, or you might need something more scalable, but Databases are pretty much one of the first things you should consider using when your info does not fit conveniently into memory or when you need a persistent store. Its what they do. If you make the change now, the upside is that as you continue to add data you won't just suddenly "hit the wall" – Gary Walker Aug 27 '13 at 13:57
  • Just a thought. Did you call BeginLoadData() on your DataTable -- never compared the overhead personally and I guess I just assumed you would have tried this. Don't forget EndLoadData() – Gary Walker Aug 27 '13 at 14:02
  • Yes I tried BeginLoadData() and EndLoadData() but they did not makes any difference. I showed a sample code... – JohnnyBravo75 Aug 28 '13 at 16:41
  • One last thing, try calling the garbage collector after loading. Lots of unused temp objects might be the issue. Repeat test, but call GC after every 1000 row see if results are the same -- this may be different as the GC may not release memory back to O/S, be forced GC calls might prevent using the extra memory in the first place -- repeated calls to GC definitely not recommended in practice but you would know if this is the real issue. – Gary Walker Aug 28 '13 at 17:20
  • Hi, also tried, but only got released when datatable is cleared (is needed, only disposing doesn´t help) and disposed, then all memory gets freed. – JohnnyBravo75 Aug 28 '13 at 19:43