10

i have a large string (e.g. 20MB).

i am now parsing this string. The problem is that strings in C# are immutable; this means that once i've created a substring, and looked at it, the memory is wasted.

Because of all the processing, memory is getting clogged up with String objects that i no longer used, need or reference; but it takes the garbage collector too long to free them.

So the application runs out of memory.

i could use the poorly performing club approach, and sprinkle a few thousand calls to:

GC.Collect();

everywhere, but that's not really solving the issue.

i know StringBuilder exists when creating a large string.

i know TextReader exists to read a String into a char array.

i need to somehow "reuse" a string, making it no longer immutable, so that i don't needlessly allocate gigabytes of memory when 1k will do.

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • You can't, unless you pin it and go unsafe (you can modify the buffer directly with unsafe code). I think you might need to go with a stream and read only little bits at a time – Andras Zoltan Sep 06 '11 at 19:17
  • Depending on what you're doing with the data, it might make sense to implement your own "string" class, where substrings are actually references into the parent string (akin to what Java does with `substring`). That way only the original string data is stored in memory. You may want to see this post as well: http://stackoverflow.com/questions/6742923/if-strings-are-immutable-in-net-then-why-does-substring-take-on-time/6750591#6750591 – dlev Sep 06 '11 at 19:21
  • http://msdn.microsoft.com/en-us/magazine/cc534993.aspx – Hans Passant Sep 06 '11 at 19:23
  • I'd guess you still have references that you aren't aware of. – David Heffernan Sep 06 '11 at 19:24
  • 3
    If you're making a ten million character string, odds are good you're doing something wrong. Why do you have a string this big in memory in the first place? Do you need to have the whole thing in memory to parse it? Parsers typically consume strings in a forwards-only fashion with limited look-ahead; why do you need the whole string in memory at once? – Eric Lippert Sep 06 '11 at 20:02
  • @Eric Lippert: We're processing `MHT` files; each one as a string in memory. `MHT` is a single-file web-page; a customer uses this as their transportable representation of a person. It contains base64 encoded images large enough for facial recognition. It is conceivable in the case where the database is on a hard-drive or CD we could use a `StreamReader` (and a `StringReader` when fetch them in-memory from a web-site) - but then we have to process it as an array of `Char`. Doing that we lose all the useful methods `String` gets us (StartsWith, SubString, IndexOf). Plus it's already written. – Ian Boyd Sep 06 '11 at 20:40
  • A string is not a "persistent" data structure; that is, it was not designed to make efficient re-use of memory when altered. You are almost certainly running into all kinds of problems as a result. (See http://blogs.msdn.com/b/ericlippert/archive/2011/07/19/strings-immutability-and-persistence.aspx for some more analysis.) I would be building a better abstraction here than a string if I were you; I'd be using a stream-based approach that turns a stream of characters into a stream of tokens, and then turns that stream of tokens into a stream of parsed nodes. – Eric Lippert Sep 06 '11 at 21:20
  • Alternatively, I would consider building a struct that efficiently represents a substring of an existing string, rather than actually building the substring. That is, make your own persistent data structure that is layered on top of an existing string, rather than making a big copy of the string whenever you do some nonpersistent operation on it. Those are the sorts of abstractions we build in the compiler to deal with things like multi-megabyte files of source code that we need to parse. – Eric Lippert Sep 06 '11 at 21:22

4 Answers4

12

If your application is dying, that's likely to be because you still have references to strings - not because the garbage collector is just failing to clean them up. I have seen it fail like that, but it's pretty unlikely. Have you used a profiler to check that you really do have a lot of strings in memory at a time?

The long and the short of it is that you can't reuse a string to store different data - it just can't be done. You can write your own equivalent if you like - but the chances of doing that efficiently and correctly are pretty slim. Now if you could give more information about what you're doing, we may be able to suggest alternative approaches that don't use so much memory.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • It's likely that Jon is correct and that you're holding some kind of reference to the strings, therefore blocking the clean up. However, if this isn't the case and you have to reuse the memory for your strings, you may consider using 'Unsafe' code, but only as a last resort. You can find more details on this here http://msdn.microsoft.com/en-us/library/aa288474%28v=vs.71%29.aspx. – Jeff Reddy Sep 06 '11 at 19:23
  • We're not holding a reference, per se; if we force the garbage collector to run the memory is released. At some level you can existentially argue that we're holding the memory, since it *is* allocated in my process space. There's nothing *stopping* the GC from freeing it - except that it doesn't run fast enough. – Ian Boyd Sep 06 '11 at 20:26
  • @Ian: That's at least unusual. How large are the substrings? What's the source of the original large string, and could you stream it (e.g. work one line at a time)? – Jon Skeet Sep 06 '11 at 20:28
  • They're `MHT` files; single-file encoded web-pages; a customer uses them as a serialization format for information about people (e.g. FBI most wanted). It contains base-64 encoded images, large enough to be suitable for facial recognition. When needed we process a few hundred thousand. Sometimes the `mht` files will be on a hard-drive or CD; but they can also come in from our `WebRequest`. Each one is loaded into memory (as a string) and processed. – Ian Boyd Sep 06 '11 at 20:59
  • @Ian: Could you stream them, perhaps, rather than loading the whole page into memory in one go? – Jon Skeet Sep 06 '11 at 21:00
  • I'm a little confused. I thought the runtime always fired off a full GC, including all generations and the LOH, prior to throwing an `OutOfMemoryException`. Is that not the case? Can an OOM get thrown when there are still large amounts of unreferenced memory in the LOH and SOH? – Greg D Sep 07 '11 at 12:53
  • Ah- with the exception being a fragmented LOH, I suppose. That could do it. Especially if every string getting alloc'd is a little larger than the last one. – Greg D Sep 07 '11 at 12:59
  • @Jon: We *could* stream them, but it would have to be into something other than `Strings`; since we'd be back where we were. It seems to me that the *only* way to handle it is either `.Read` into `Char` array (and write our own `Char[]` string library), or call `GC.Collect()` periodically. The downside of the former is that is requires a complete re-write of (logically) correct code, and having to create our own (almost certainly buggy) string library. The downside of the latter is that it feels like a hack. – Ian Boyd Sep 07 '11 at 17:09
  • @Ian: Well, you could read a line at a time using `TextReader.ReadLine` if that would be simple enough to process - that would create far more strings, but they'd be *small* strings which would presumably never make it past gen0, and *wouldn't* end up on the large object heap. – Jon Skeet Sep 07 '11 at 17:11
  • i was going to ask what "small* is, but i see @Neil Fenwick's link that says about `85,000` bytes. In my case that's a bit under `42,000` characters. Accepted; i may not like it: but such is life with a garbage collector. – Ian Boyd Sep 07 '11 at 19:30
  • @Ian: Yes, that's right - although typically a single line would be rather less than that, of course :) You could read buffers of 40000 characters at a time though, if that would help... – Jon Skeet Sep 07 '11 at 19:32
  • The symptoms of fragmented free-space in the Large Object Heap sounds like it matches exactly our situation. If possible, limiting objects to the smaller object heap can hopefully help the GC keep up. – Ian Boyd Sep 07 '11 at 19:37
  • @Ian: I sympathise for the unusual situation you're in. It always feels like it's *unlikely* for any one developer to find the GC at fault, but it does sound like you're in that minority :( Basically, calling GC.Collect should never make an app work, just like calling Thread.Sleep is never a nice solution for threading issues :( – Jon Skeet Sep 07 '11 at 19:39
  • Or `GetMessage`-`TranslateMessage`-`DispatchMessage` – Ian Boyd Sep 07 '11 at 20:01
3

This question is almost 10 years old. These days, please look at ReadOnlySpan - instantiate one from the string using AsSpan() method. Then you can apply index operators to get slices as spans without allocating any new strings.

Mr. TA
  • 5,230
  • 1
  • 28
  • 35
2

I would suggest, considering the fact, that you can not reuse the strings in C#, use Memory-Mapped Files. You simply save string on a disk and process it with performance/memory-consuption excelent relationship via mapped file like a stream. In this case you reuse the same file, the same stream and operate only on small possible portion of the data like a string, that you need in that precise moment, and after immediately throw it away.

This solution is strictly depends on your project requieremnts, but I think one of the solutions you may seriously consider, as especially memory consumption will go down dramatically, but you will "pay" something in terms of performance.

Tigran
  • 61,654
  • 8
  • 86
  • 123
  • Even if the file was memory mapped, we still have the issue where we read parts into `Strings` - sometimes large parts, sometimes small parts, sometimes smaller parts from the larger parts. Eventually all these uncollected Strings choke off free memory - or cause a swapping death. – Ian Boyd Sep 07 '11 at 17:11
  • @Ian frequent, but short term use of relatively small strings, as you do not need to load all data in memory in this case, should make significant difference, by me. – Tigran Sep 07 '11 at 18:37
1

Do you have some sample code to test whether possible solutions would work well?

In general though, any object that is bigger than 85KB is going to be allocated onto the Large Object Heap, which will probably be garbage collected less often.

Also, if you're really pushing the CPU hard, the garbage collector will likely perform its work less often, trying to stay out of your way.

Neil Fenwick
  • 6,106
  • 3
  • 31
  • 38