5

I've got a few huge xml files, 1+ gb. I need to do some filtering operations with them. The easiest idea I've come up with is to save them as txt and ReadAllText from them, and start doing some operations like

  var a = File.ReadAllText("file path");
  a = a.Replace("<", "\r\n<");

The moment I try to do that, however, the program crashes out of memory. I've looked at my task manager while I run it and the RAM usage climbs to 50% and the moment it reaches it the program dies.

Does anyone have any ideas on how I operate with this file avoiding the OutOfMemory exception or allow the program to pull on more of the memory.

cybera
  • 351
  • 2
  • 17
  • 4
    Use streams, not strings. – Wai Ha Lee Nov 01 '17 at 07:34
  • Is the replacement the “filtering”, or is it something else? Take a look at [`XmlReader`](https://msdn.microsoft.com/en-us/library/system.xml.xmlreader(v=vs.110).aspx), anyway. (I think that’s the right one.) – Ry- Nov 01 '17 at 07:35
  • In general, try to *avoid* treating XML as "just strings". Use tools *designed* for working with XML as much as possible, unless what you're trying to produce isn't XML but is "something that looks like XML but I'm doing odd things to it such that it isn't technically XML any more" – Damien_The_Unbeliever Nov 01 '17 at 07:37
  • read this thread seems similar to your question https://stackoverflow.com/questions/15772031/how-to-parse-very-huge-xml-files-in-c – Dhaval Pankhaniya Nov 01 '17 at 07:39
  • My end result is trying to compare the data in each element from one xml file to the data in another element with the same unique ID in another xml file. The problem is that there is no guarantee of any sort what the data in the element will be. 5 tags in one element 150 tags in the next. Because of this I moved out of using the xml library, I just couldn't figure out how to make this check with it. – cybera Nov 01 '17 at 07:44
  • Your `ReadAllText` reads in one copy of your entire file, then that `Replace` creates a *second* copy. – Hans Kesting Nov 01 '17 at 07:50
  • 2
    If you compare xml elements between two files - even less reason to treat xml as text, because two xml elements might have different text representation (like self-closing tag vs open-close tag) while having identical content. – Evk Nov 01 '17 at 07:53
  • 1
    And to add to Evks examples, *semantically* `` and `` are the same also. – Damien_The_Unbeliever Nov 01 '17 at 08:01
  • ['Linq-to-XML'](https://blogs.msdn.microsoft.com/xmlteam/2007/03/05/streaming-with-linq-to-xml-part-1/) will let you stream the data, which might help. Quote: *... the target audience for LINQ to XML will sometimes encounter large documents or arbitrary streams of XML; they want the ease of use that LINQ to XML offers, but they don't want to have to load an entire data source into an in-memory tree before starting to work with it.*. More details [here](https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/linq-to-xml-overview). – Matthew Watson Nov 01 '17 at 08:18
  • To process large XML files you can use XPath (e.g. XPathDocument, XPathNavigator, etc.). Also take a look at XML Diff. – Greg Nov 01 '17 at 08:20

1 Answers1

6

If you can do it line by line, instead of saying "Read everything to memory" with File.ReadAllText, you can say "Yield me one line at time" with File.ReadLines.

This will return IEnumerable which uses deferred execution. You can do it like this:

using(StreamWriter sw = new StreamWriter(newFilePath))
foreach(var line in File.ReadLines(path))
{
    sw.WriteLine(line.Replace("<", "\r\n<"));
}

If you want to learn more about deferred execution, you can check this github page.

titol
  • 999
  • 13
  • 25