1

I rarely turn here for help, but this is driving me crazy: I'm reading an xml file that wraps an arbitrary number of items, each with a b64-encoded file (and some accompanying metadata for it). Originally I just read the whole file into an XmlDocument, but while that was much cleaner code, I realized there's no limit on the size of the file, and XmlDocument eats a lot of memory and can run out if the file is large enough. So I rewrote the code to instead use XmlTextReader, which works great if the issue is that the program was sent an xml file with a large number of reasonably-sized attachments... but there's still a big problem, and that's where I turn to you:

If my xml reader is at a File element, that element contains a value that's enormous (say, 500MB), and I call reader.ReadElementContentAsString(), I now have a string that occupies 500MB (or possibly an OutOfMemoryException). What I would like to do in either case is just write to a log, "that file attachment was totally way too big, we're going to ignore it and move on", then move onto the next file. But it doesn't appear that the string I just tried to read is ever garbage collected, so what actually happens is the string takes up all the RAM, and every other file it tries to read after that also throws an OutOfMemoryException, even though most of the files will be quite small.

Recall: at this point, I'm reading the element's value into a local string, so I would have expected it would be eligible for garbage collection immediately (and that it would thus be garbage collected, at the latest, when the program attempts to read the next item and discovers it has no memory available). But I've tried everything, just in case: setting the string to null, calling explicit GC.Collect()... no dice, Task Manager indicates the GC only collected about 40k, of the ~500MB it just requested to store the string in, and I still get out of memory exceptions attempting to read anything else.

There doesn't seem to be any way to know the length of the value contained in an xml element using XmlTextReader without reading that element, so I imagine I'm stuck reading the string... am I missing something, or is there really no way to read a giant value from an xml file without totally destroying your program's ability to do anything further afterwards? I'm going insane with this.

I have read a bit about C#'s GC, and the LOH, but nothing I read would have indicated to me that this would happen...

Let me know if you need any further information, and thanks!

edit: I did realize that the process was running as a 32-bit process, which meant it was being starved for memory a bit more than it should've been. Fixed that, this becomes less of an issue, but it is still behavior I'd like to fix. (It takes more and/or larger files to reach the point where an OutOfMemoryException is thrown, but once it is thrown, I still can't seem to reclaim that memory in a timely fashion.)

neminem
  • 2,658
  • 5
  • 27
  • 36
  • In Re to strings and garbage collection: http://stackoverflow.com/questions/2423111/strings-and-garbage-collection – Pete Garafano Apr 18 '13 at 15:48
  • I did see that question, but it was mostly talking about security, not about memory management (and interning, which, the string I'm reading is definitely not a literal, so it shouldn't have anything to do with interning...) – neminem Apr 18 '13 at 15:58
  • good call, I missed the literal. Are you storing this as part of a larger object or just a plain old string? – Pete Garafano Apr 18 '13 at 16:01
  • You could read it in as a stream, then call `.Length` on the stream to get the number of bytes. Then feed the stream to the `XmlTextReader` (I assume this has an overload that takes a stream). – Pete Garafano Apr 18 '13 at 16:06
  • Can you limit the scope of the variable reading the string? Does that help? – shahkalpesh Apr 18 '13 at 16:07
  • Originally I was storing it directly into a property of an object, but when I realized that could be an issue, as I said, I tried storing it directly into a local variable (so, the scope is a string inside the loop, inside a "if (key == "File")"), and only saving it to the object to keep around for later if it wasn't too large. – neminem Apr 18 '13 at 16:10
  • @TheGreatCO As I said, the issue isn't the whole file being too large, just a single element value. I want to be able to throw out that element, but keep all the rest. As far as I know, you can't turn a single element from an XmlTextReader into a stream? – neminem Apr 18 '13 at 16:11
  • I once had a similar situation and ended up creating a custom finite-state machine that read the file char-by-char. This was for a flat file, not XML, so didn't have to deal with recursive data structures but, if you can be sure there are no recursive elements, there are many FSM code generators available for C#. – Dour High Arch Apr 18 '13 at 16:20
  • *There doesn't seem to be any way to know the length of the value contained in an xml element using XmlTextReader without reading that element* -- Well, yeah. It's kind of hard to know how large a string is before you read it from the disk. – Jim Mischel Apr 19 '13 at 23:35
  • -1 : "but there's still a big problem, and that's where I turn to you" "I'm going insane with this"... "that file attachment was totally way too big" etc. etc. etc. Since you appear to be an English speaker and this is a site for professionals, please try to write like one. This is not Facebook... – Vector Jul 10 '14 at 23:26

3 Answers3

1

I had a similiar Issue with a soap Service used to transfer large files as base64 string.

I used XDocument instead of XmlDocument back then, that did the trick for me.

CSharpie
  • 9,195
  • 4
  • 44
  • 71
  • Looks like XDocument still reads the whole node into memory, so I'm not sure how that would really be better... it could still run out of memory reading a particularly large attachment, couldn't it? (And then you couldn't read the metadata to know which file failed, same issue as XmlDocument.) – neminem Apr 18 '13 at 17:23
1

You may use XmlReader.ReadValueChunk method to read the contents of an element one "chunk" at a time instead of trying to read the whole content at once. This way you may for example decide at some point that the data is too large and then ignore it and log the event. StringBuilder is probably the best way to combine the collected char array chunks in one string.

If you want to release memory with GC.Collect(), you can force immediate finalizations and memory release with GC.WaitForPendingFinalizers(). This may affect performance (or even hang, see description behind the link), but you should get rid of the large objects assuming you don't have any live references to them anymore (i.e. the local variables are already out of scope or their value is set to null) and continue operations normally. You should of course use this as a last resort, when memory consumption is an issue and you really want to force getting rid of the excess memory allocations.

I have successfully used GC.Collect();GC.WaitForPendingFinalizers(); combination in a memory-sensitive environment to keep the memory footprint of an application well under 100MB, even when it reads through some really large XML files (>100MB). To improve performance I also used Process.PrivateMemorySize64 to track memory consumption and force finalizations only after certain limit was reached. Before my improvements memory consumption did sometimes rise over 1GB!

xjuice
  • 300
  • 1
  • 8
0

I am not positive this is the case but I think you need to dispose of the XmlTextReader. Save the xmlpath of the node after the excessively large node to a string, set your massive string to null, then dispose of the XmlTextReader and reopen it at the node after the large node. From what I understand if you set your string to null, or it goes out of scope, the GC should free that memory asap. It seems more likely to me that you're freeing the string but you continue doing operations with the XmlTextReader which is now holding onto a ton of memory.

Another idea that came to mind was to try doing this within an unsafe block and then freeing the memory explicitly, however, it doesn't look like that's possible (someone else might know but after looking around a bit it seems the unsafe block is still GC'd, it just gives you pointers). Yet another option, although imo a terrible one, would be to make a dll for parsing in C or C++ and call it from your C# project.

Try the first suggestion before doing anything crazy like the last one :)

evanmcdonnal
  • 46,131
  • 16
  • 104
  • 115
  • That sounded so promising... but I closed the XmlTextReader, set it to null, and called GC.Collect, and it still indicated all my ram was taken. In any case, I'm not sure how you would be able to easily set a new XmlTextReader to the location of the old one, anyway? – neminem Apr 18 '13 at 17:22
  • Well, I think you're right in thinking that the XmlTextReader was holding things open: I tried instead using xmlReader.ReadSubtree() into a new XmlReader, reading the content using that, then disposing that temp reader, and *if* I didn't get an OutOfMemoryException from it, it successfully gave me back about *half* of the memory it requested... which is interesting, but still not terribly useful. (It does appear to be giving back the rest when the main XmlTextReader goes out of scope, but again, I'm not sure how useful that would be...) – neminem Apr 18 '13 at 18:49
  • @neminem is there a reason why you're doing this in C#? Maybe you should go the C/C++ route. They're obviously superior for these types of operations both with regard to performance and control. GC'd languages are just nice because you can handle the common cases with far less effort. – evanmcdonnal Apr 18 '13 at 21:14
  • This is one component of a larger project that is clearly meant to be written in a managed language. (Also because it's already all written, the issue with giant xml elements just came up in testing. Honestly, if I can't find a fix for it, it'll just be a known issue, "don't send us giant, multi-hundreds-of-megs attachments", and the client will be fine with that, they weren't actually planning on doing it anyway. I just like being robust in handling all potential inputs.) – neminem Apr 18 '13 at 21:24