3

I have like about 100,000 sentences in a List<string>.

I'm trying to split each of these sentences by words and add everything into List<List<string>> where each List contains a sentence and which contains another List of words. I'm doing that because I have to do a different work on each individual words. What would be the size difference of just List<string> of sentences vs List<List<string>> of words in memory?

One of these will be stored in memory eventually so I'm looking for the memory impact of splitting each sentence vs just a string

iefpw
  • 6,816
  • 15
  • 55
  • 79
  • 3
    Why theorize? Why not try both and look at what happens to the total memory of your program. If your total data set isn't *really* large chances are, whatever it is, it's low enough to not be a problem. If not, it's still easy enough to find out. – Servy Apr 25 '13 at 19:43
  • http://stackoverflow.com/questions/16214954/how-much-ram-c-sharp-application-can-use – Roar Apr 25 '13 at 19:45
  • While you could just try it, in practice I'm not sure what would happen with how different words and sentences will be stored in the intern pool – Yuriy Faktorovich Apr 25 '13 at 19:47
  • 1
    @YuriyFaktorovich Only compile time constants are interned, so it wouldn't be affected by interning at all unless you explicitly intern them. – Servy Apr 25 '13 at 19:52
  • @Servy that's what I thought, but I think at one point I proved that wasn't always true. – Yuriy Faktorovich Apr 25 '13 at 20:13
  • @YuriyFaktorovich Possibly, although you might just be thinking of Java, which I believe is more aggressive in it's interning. – Servy Apr 25 '13 at 20:14
  • @Servy definitely not. I played around quite a bit with how strings are interned. But I might remember wrong. – Yuriy Faktorovich Apr 25 '13 at 20:16
  • @Servy I was just playing around some more and found the following `var n = "a"; ReferenceEquals(n + "b", String.IsInterned(n + "b"));//false` but `ReferenceEquals("a" + "b", String.IsInterned("a" + "b"));//true` so I might have just made a mistake in my original tests – Yuriy Faktorovich Apr 25 '13 at 20:23
  • A 100K sentences must contain a lot of repeated words. You will be able to save a lot of memory if referencing each word as many times as possible. – Casperah Apr 25 '13 at 20:36
  • @Casperah but that memory savings isn't free, it comes at the cost of a lock of checking into the intern pool, and it also means that when he's done handling all of this data the string data will be around until the entire application ends, even though it's not being used, which could be a real problem. Remember, memory is cheap. If you got several gigs to spare (most computers these days do) then a few extra dozen MB just isn't a problem. – Servy Apr 25 '13 at 20:39
  • To eliminate duplicate storage of words you may want to use a single instance string store ( http://www.codeproject.com/Articles/38057/Single-Instance-String-Store-for-NET ) to de-duplicate words after splitting down to them. This should save quite a bit with little additional overhead and has the advantage of allowing garbage collection (whereas string interning can only be freed by unloading the AppDomain). – Rob Parker Apr 25 '13 at 21:38

3 Answers3

4

So, first off we'll compare the difference in memory between a single string or two strings which, if concatted together, would result in the first:

string first = "ab";

string second = "a";
string third = "b";

How much memory does first use compared to second and third together? Well, the actual characters that they need to reference is the same, but every single string object has a small overhead (14 bytes on a 32 bit system, 26 bytes on a 64 bit system).

So for each string that you break up into a List<string> representing smaller strings there is a 14 * (wordsPerSentance - 1) byte overhead.

Then there is the overhead for the list itself. The list will consume one word of memory (32 bits on a 32 bit system, 64 on a 64 bit system, etc.) for each item added to the list plus the overhead of a List<string> itself (which is 24 bytes on a 32 bit system).

So for that you need to add (on a 32 bit system) (24 + (8 * averageWordsPerSentance)) * numberOfSentances bytes of memory.

Servy
  • 202,030
  • 26
  • 332
  • 449
  • The overhead of a `List` is the allocation overhead, plus memory for `Count` and `Capacity`, plus array allocation overhead (on the order of 50 bytes: normal 24 byte allocation overhead, plus metadata). – Jim Mischel Apr 25 '13 at 20:00
4

We'll start with your List<string>. I'm going to assume the 64-bit runtime. Numbers for the 32-bit runtime are slightly smaller.

The List itself requires about 32 bytes (allocation overhead, plus internal variables), plus the backing array of strings. The array overhead is 50 bytes, and you need 8 bytes per string for the references. So if you have 100,000 sentences, you'll need at minimum 800,000 bytes for the array.

The strings themselves require something like 26 bytes each, plus two bytes per character. So if your average sentence is 80 characters, you need 186 bytes per string. Multiplies by 100K strings, that's about 18.5 megabytes. Altogether, your list of sentences will take around 20 MB (round number).

If you split the sentences into words, you now have 100,000 List<string> instances. That's about 5 megabytes just for the List<List<string>>. If we assume 10 words per sentence, then each sentence's list will require about 80 bytes for the backing array, plus 26 bytes per string (total of about 260 bytes), plus the string data itself (8 chars, or 160 bytes total). So each sentence costs you (again, round numbers) 80 + 260 + 160, or 500 bytes. Multiplied by 100,000 sentences, that's 50 MB.

So, very rough numbers, splitting your sentences into a List<List<string>> will occupy 55 or 60 megabytes.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
1

Unfortunately, this isn't a question that can be answered very easily -- it depends on the particular strings, and what lengths you're willing to go to in order to optimize.

For example, take a look at the String.Intern() method. If you intern all the words, it's possible that the collection of words will require less memory than the collection of sentences. It would depend on the contents. There are other implications to interning, though, so that might not be the best idea. Again, it would depend on the particulars of the situation -- check the "Performance Considerations" section of the doc page I linked.

I think the best thing to do is to use GC.GetTotalMemory(true) before and after your operation to get a rough idea of how much memory is actually being used.

Jeremy Todd
  • 3,261
  • 1
  • 18
  • 17
  • Interning will result in equal or less memory, possibly a lot less, possibly just a bit less, but not more. However, it comes at the cost of CPU cycles. So if you're willing to use less memory at the cost of running longer, it's worth interning. – Servy Apr 25 '13 at 20:08
  • Well, in the degenerate case where every word appears exactly once, it could result in using more memory than the list of sentences did, due to the overhead from all the extra string instances, null terminators, additional `List`s and whatnot, but yeah, in general it would probably save space. But then there's also the issue that that memory won't be reclaimed even after the string instances are collected, too. – Jeremy Todd Apr 25 '13 at 21:04