Is my dictionary size normal?

Question

I have a 150mb file. Each line is made up of the same format eg/

I,h,q,q,3,A,5,Q,3,[,5,Q,8,c,3,N,3,E,4,F,4,g,4,I,V,9000,0000001-100,G9999999990001800000000000001,G9999999990000001100PDNELKKMMCNELRQNWJ010, , , , , , ,D,Z,

I have a Dictionary<string, List<string>>

It is populated by opening the file, reading each line, taking elements from the line and adding it to the dictionary, then the file is closed.

StreamReader s = File.OpenText(file);
 string lineData = null;
 while ((lineData = s.ReadLine()) != null)
 {
   var elements = lineData.Split(',');
   var compareElements = elements.Take(24);
   FileData.Add(elements[27], new List<string>(compareElements));

  }
  s.Close();

Using the method in this answer I calculated my dictionary to be 600mb. That's 4 times what the file is.

Does that sound correct?

While not directly related to your problem, you should make sure to dispose of anything that implements IDisposable. In your case, instead of just calling s.Close() you should call s.Dispose() or wrap the StreamReader in a using-block. — Kevin Wienhold, Nov 09 '11 at 13:19
@Jon: Yes, it also applies to StreamWriter. One of my most desired features for Visual Studio in the future is some form of highlighting for types that implement IDisposable, so far, your best bet is to see whether a given class has a Dispose() method. — Kevin Wienhold, Nov 09 '11 at 13:33

Vilx- · Answer 1 · 2011-11-09T13:39:06.143

3

Apart from the fact that the method is not very reliable, in your case there is an even greater overhead. Did you notice that each iteration of your loop creates a new instance of the elements array, the lineData string, and the elements.Take also has some internal variables that get created on each call? Since you probably have enough RAM, the .NET garbage collector doesn't bother collecting them, so when you measure the TotalMemory before and after the loop, you also measure all these variables, not only your dictionary, although it might be the only thing that remains in scope afterwards.

edited Nov 09 '11 at 13:39

answered Nov 09 '11 at 13:18

Vilx-

104,512
87
279
422

Good point about elements. That can be moved out of the while loop. What would you do with lineData? – Jon Nov 09 '11 at 13:31
What? No, you can't move it out of the loop! On each call of `Split()` produces a new array; it doesn't matter that you assign it to the same variable. The old array doesn't get thrown out of the memory until the Garbage Collector decides it's time to do cleaning up. The same goes about `lineData` - each call to `ReadLine()` produces a new string instance. And there's no way around it. – Vilx- Nov 09 '11 at 13:35
There is _little_ provable relation between the size of BinaryFormatter output and the size when stored on the heap. If anything, I wouldn't be surprised if binary serialization of `List` turned out an order of magnitude more compact than the inmemory representation. – sehe Nov 09 '11 at 13:37
@Vilx- the garbage collector _will bother/doesn't have to bother_, [that's what generation 0 heap is for](http://msdn.microsoft.com/en-us/library/ms973837.aspx). – sehe Nov 09 '11 at 13:38
OK, removed the remark about BinaryFormatter. True. Anyways, GC won't bother until you start running out of RAM, or it gets bored, or whatever. But _most likely_ it won't start collecting stuff right after you leave your loop. – Vilx- Nov 09 '11 at 13:40
@Vilx-: I've had this little **[C# tight loop (IL code shown) with fast, small allocations (Gen0)](http://ideone.com/zZHuS)** running for over 29 minutes now, and it never grew in size. I've got 8Gbs of RAM (64bit OS) of which >50% is free. I don't know the exact mechanics of .NET garbage collection, but it seems safe to assume it doesn't wait to get bored before optimizing Gen0 allocations. The loop is tightly 100% CPU-bound of course. – sehe Nov 09 '11 at 14:29
Curious. Well, I guess they've made the GC smarter than I know of. :P Still, I would not rely on this when trying to measure things. For all we know there might be other hidden factors in play (like .NET version, or maybe free CPU cores or something) which can differ on your machine and OP's machine. – Vilx- Nov 09 '11 at 15:02

score 1 · Answer 2 · answered Nov 09 '11 at 13:19

1

Yes, because you are turning chars into string pointers, which are 4 or 8 bytes each.

answered Nov 09 '11 at 13:19

Tudor

61,523
12
102
142

score 1 · Answer 3 · answered Nov 09 '11 at 13:21

1

I assume your file is encoded UTF-8 and contains mostly ASCII. strings in C# are UTF-16 though, so that explains most of the size difference right there (factor of 2). There is also some overhead for data structures of course.

answered Nov 09 '11 at 13:21

BrokenGlass

158,293
28
286
335

sehe · Accepted Answer · 2011-11-09T13:50:55.350

1

Most of these entities take only a single character, yet you are storing them as strings. The reference pointer to those string alone is going to take at least twice as much space (in case of UTF8 likely 4-8 times as much). Then there is the overhead of keeping a hash table structured for the dictionary.

The List<> in itself should be really efficient storage wise (it uses an array internally)

Room for improvement:

you could use List<char> or char[] instead of List<string> if you know that the fields will fit
you could use struct Field { char a,b/*,...*/; } and List instead of List if you need more than 1 character per field
You could forgo the eager field extraction [<-- recommended]:
```
 var dict = File.ReadAllLines(file)
      .ToDictionary(line => line.Split(',')[27]);
```
This gives you the opportunity to access the compareElements on demand:
```
 string[] compareElements = dicts["key27"].Split(',')/*.Take(24).ToArray()*/;
```
This is a classic example of runtime/storage cost trade-off

Edit an obvious hybrid would be:

struct AllCompareElements
{
     public char field1, field2, ... field24;
     // perhaps:
     public char[2] field13; // for the exceptional field that is longer than 1 character
}

Happily employ Resharper to implement Equals, GetHashCode, IEquatable<AllCompareElements>, IComparable<AllCompareElements>

edited Nov 09 '11 at 13:50

answered Nov 09 '11 at 13:22

sehe

374,641
47
450
633

If I used char[] I would have to remove the commas. Not a problem, just saying but its an overhead and the string.Split is doing it for me already – Jon Nov 09 '11 at 13:28
@Jon: You missed the point. I meant store the result of split in List, not List: `line.Split(',').Take(24).Select(s => s[0]);`. Not only is a char usually smaller than a string, most importantly, char is a `ValueType`, string is a `reference type`. Especially when storing valuetypes in a generic `List<>` the storage requirements get drastically optimized. – sehe Nov 09 '11 at 13:31
I'll look into that. I don't see the benefit of doing the ReadAllLines as I only need to store the 24 in the dictionary so your example would have me then find the item in the dictionary and then replace the contents – Jon Nov 09 '11 at 13:39
@Jon: huh? ReadAllLines is just _shorter_ for your own while loop. Note that you take the first 24 of 'compare elements' (commadelimited fields out of split), not of the lines in the input file? – sehe Nov 09 '11 at 13:52
The reason I do a Take(24) is because I use the dictionary in another part of the program and only need those 24 elements – Jon Nov 09 '11 at 14:00
@Jon I guessed that :) You're remark `... doing the ReadAllLines as I only need to store the 24 ...` confused me. Also, the point is that you might get better storage efficiency. Of course it depends on how much data comes after those 24 fields, but there are ways in between. Main point, a single string will have better sotrage efficiency than a `List` with 24 elements. That's a fact. – sehe Nov 09 '11 at 14:21
Using List has helped a lot! Same code as in my question just replaced List to List – Jon Nov 09 '11 at 17:11
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/4843/discussion-between-sehe-and-jon) – sehe Nov 09 '11 at 18:13

Michał Powaga · Answer 5 · 2011-11-09T13:57:42.757

1

If your file is encoded in ANSI or UTF-8 (but without special characters then size is same as ANSI) (each char 1 byte) and string - "Represents text as a series of Unicode characters." (Unicode = UTF-16, each char 4 bytes) it's 4 times more.

edited Nov 09 '11 at 13:57

answered Nov 09 '11 at 13:35

Michał Powaga

22,561
8
51
62

score 0 · Answer 6 · answered Nov 09 '11 at 13:39

That's 600M was allocated by the operation of loading the file into the dictionary... Suggests it's an expensive operation, and it could be useful in guaging how effective any optimisation might be, but for how much memory the dictionary takes up, fairly useless.

I'd defer the splitting as suggested by sehe straight off.

Looks to me like you've preamturely optimised for speed and that's cost you big style on memory foot print.

Is my dictionary size normal?

6 Answers6