1

I have an application that read 3-4 GB s of data, build entities out of each line and then stores them in Lists.

The problem I had is, memory grows insane becomes like 13 to 15 GB. Why the heck storing these entities takes so much memory.

So I build a Tree and did something similar to Huffman Encoding, and overall memory size became around 200 - 300 MB.

I understand, that I compacted the data. But I wasn't expecting that storing objects in the list would increase the memory so much. Why did that happen?

how about other data structures like dictionary, stack, queue, array etc?

Where can I find more information about the internals and memory allocations of data structures?

Or am I doing something wrong?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
DarthVader
  • 52,984
  • 76
  • 209
  • 300
  • Do you **really** need all that data (3-4 GB is huge amount) in the memory? Why not use a database or read only the required lines on demand? – Mayank Mar 25 '11 at 06:00
  • irrelevant. that wasnt my question. but yes i need every line. and files are the database. if i do line by line, i cant paralelize the program. – DarthVader Mar 25 '11 at 06:03
  • 1
    "or am i doing something wrong?" Yes you are. Good luck! – Mayank Mar 25 '11 at 06:05
  • Whenever you need to load 3-4GB of data into memory and have them processed as one whole lot (and you can't chop them up into pieces), then 99% of the time you are approaching the problem in the wrong way. – Stephen Chung Mar 25 '11 at 06:11
  • why cant i chop them up? i can always partition a huge list and process the partitions and merge the results. like the divide and conquer thing. – DarthVader Mar 25 '11 at 06:16
  • 1
    Perhaps you could create a special structure for handling these sorts of data operations. A base level object tuned for managing operations like retrieval, sorting, concurrency, memory usage and so on. That base object for managing data might be called something like a "database." In fact, those operations are sufficiently complicated that some enterprising developers might make money selling a whole product that does just that. Bah...that's all crazy talk.... – Thomas Mar 25 '11 at 06:17
  • If you have asked this question because you are curious about Data Structures, then you might want to read [An Extensive Examination of Data Structures](http://msdn.microsoft.com/library/aa289148.aspx) – Mayank Mar 25 '11 at 06:25
  • I'm with @mayarik, If you have 3-4GB data and it occupies 13-15 then you are doing something wrong. Given the level of detail in your question that's all we can say about it, don't you think? – H H Mar 25 '11 at 07:11
  • what can i do wrong while i m reading from file and storing in a list? :) – DarthVader Mar 25 '11 at 13:35

3 Answers3

2

In .NET large objects go on the large object heap which is not compacted. Large is everything above 85,000 bytes. When you grow your lists they will probably become larger than that and have to be reallocated once you cross the current capacity. Rellocation means that they are very likely put at the end of the heap. So you end up with a very fragmented LOH and lots of memory usage.

Update: If you initialize your lists with the required capacity (which you can determine from the DB I guess) then your memory consumption should go down a bit.

ChrisWue
  • 18,612
  • 4
  • 58
  • 83
1

Regardless of the data structure you're going to use, your memory consumption is never going to drop below the memory required to store all your data.

Have you calculated how much memory it is required to store one instance class object?

Your huffman encoding is a space-saving optimization, which means that you are eliminating a lot of duplicated data within your class objects yourself. This has nothing to do with the data structure you use to hold your data. This depends on how your data itself is structured so that you can take advantage of different space-saving strategies (of which huffman encoding is one out of many possibilities, suitable for eliminating common prefixes and the data structure used to store it is a tree).

Now, back to your question. Without optimizing your data (i.e. objects), there are things you can watch out to improve memory usage efficiency.

Are all our objects of similar size?

Did you simply run a loop, allocate memory on-the-fly, then insert them into a list, like this:

foreach (var obj in collection) { myList.Add(new myObject(obj)); }

In that case, your list object is constantly being expanded. And if there is not enough free memory at the end to expand the list, .NET will allocate a new, larger piece of memory and copies the original array to the new memory. Essentially you end up with two pieces of memory -- the original one, and the new expanded one (now holding the list). Do this many many many times (as you obviously need to for GB's of data), and you are looking at a LOT of fragmented memory spaces.

You'll be better off just allocating enough memory for the entire list at one go.

As an afternote, I can't help but wondering: how in the world are you going to search this HUGE list to find something you need? Shouldn't you be using something like a binary tree or a hash-table to aid in your searching? Maybe you are just reading in all the data, perform some processing on all of them, then writing them back out...

Stephen Chung
  • 14,497
  • 1
  • 35
  • 48
  • lol. i m not searching anything from the list. once i populate the data, i generate report out of it. aggreagate the results. I wanna read all in once so that i can actually run the data in parallel in memory. i understand your point. – DarthVader Mar 25 '11 at 06:14
  • if i process this crap line by line, I have to wait on IO and it s actually taking longer than reading the whole thing at once then divide and conquer. – DarthVader Mar 25 '11 at 06:15
  • The point of Stephen is that when creating the List, you should estimate or even better know the number of lines so that you built the list with the right size right from the beginning. Knowing the size will make the original list setup slow, but afterwards everything will become fast afterwards. – weismat Mar 25 '11 at 06:40
  • @user177883, in this case you might want to, say, import 1000 lines at a time. Process it. Then import the next 1000 lines. As long as you are not referring to lines all over the place and are just aggregating information there is no need to read the whole thing into memory. – Stephen Chung Mar 25 '11 at 06:45
  • @user177883, processing the file a block (say 1000 lines) at a time will also work much better when you parallelize the processing. If you have a system with 8 cores, say, and you queue ten million tasks to it, each core will have an internal queue of 1 million tasks -- consuming memory while waiting. If you chop it up into blocks, then each core will only have 100 tasks queued to it. You are *not* getting it done faster than your number of cores -- which is 8 records at one time. – Stephen Chung Mar 25 '11 at 06:51
0

If you are using classes, read the response of this: Understanding CLR object size between 32 bit vs 64 bit

On 64 bits (you are using 64 bits, right?) object overhead is 16 bytes PLUS the reference to the object (someone is referencing him, right?) so another 8 bytes. So an empty object will "eat" at least 24 bytes.

If you are using Lists, remember that Lists grow by doubling, so you could be wasting much space. Other .NET collections grow in the same way.

I'll add that the "pure" overhead of million of Lists could bring the memory to his knees. Other than the 16 + 8 bytes of space "eaten" by the List object, it is composed (in the .NET implementation) of 2 ints (8 bytes), a SyncLock reference (8 bytes, it's null normally) and a reference to the internal array (so 8 + 16 bytes + the array)

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280