How to parse large XML with .net without loading any node into memory?

Question

I want to parse large(4GB-5GB) XML files with .net 4. Because of memory issues I am using XMLReader and to check the extraction condition, I am using XPath. But, in this case, I need to load the data of the specific element node into memory using XElement or XDocument to check the extract condition.

Now, I have a concern that if that specific element node has lots of descendants then the program will definitely face out of memory exception. So, is there any way to avoid the loading of any element node into memory while parsing with condition checking? Any suggestion?

Thanks in advance

When you say "node", do you mean an element node, that would have a lot of descendants? Or do you mean a text node, with a lot of text content? The first could be done incrementally with several tools; the second is a different animal. — LarsH, Aug 26 '15 at 14:23
Yeah, I wanted to mean that an element node that would have a lot of descendants. I have edited my question. And, would you please let me know about those tools? @LarsH — Akib, Aug 27 '15 at 04:39
This has been marked as a duplicate question, with a link to earlier answers (http://stackoverflow.com/questions/15772031/how-to-parse-very-huge-xml-files-in-c). Please follow that link to learn about those tools. The main answer is XmlReader. — LarsH, Aug 31 '15 at 22:34
As I told in my question, I am using XmlReader, but to extract the information for those element which satisfies a certain condition, I need to load that element into memory to check that if the condition holds. I want to skip this loading for extracting condition validation. This (http://stackoverflow.com/questions/15772031/how-to-parse-very-huge-xml-files-in-c) does not provide my answer. Thank you. @LarsH — Akib, Sep 03 '15 at 04:49
Sorry, Akib ... I overlooked that, and it looks like the others who voted to "close as duplicate" did too. This should be reopened. — LarsH, Sep 03 '15 at 14:39
To answer your question, I think if you're going to use XPath, there's no way to avoid loading the whole element's subtree into memory. Since XPath could access anything in the document, how would the API know how much of the subtree to load? (Disclaimer: my specialization is in XPath, not in the .NET APIs). I think if you're going to try to avoid loading the whole document into memory, you'll need to use tools other than XPath to check the extraction condition. If you explain to us the condition, maybe we can help you see how to check it without using XPath or loading the element. — LarsH, Sep 03 '15 at 14:42
P.S. I'm not sure what you meant by "overlook the loading of element node into memory while parsing with condition checking". — LarsH, Sep 03 '15 at 14:42
Thank you for your kind response. And, I know that there is no way to avoid loading subtree while using Xpath. What would be the best choice other than Xpath? And, by "overlook the loading of element node into memory while parsing with condition checking", I wanted to mean the avoiding of loading subtree into memory. @LarsH — Akib, Sep 04 '15 at 15:46
If you explain to us the sort of condition you need to check, maybe we can help you see how to check it without using XPath. Are you looking for something nearby, like the presence / values of certain attributes on the element itself? Or do you need to check conditions on aunts & uncles, following & descendants, etc.? — LarsH, Sep 04 '15 at 19:02
I need to check the condition of presence of certain attributes on the element itself and for also descendants also. Actually, I am try to making a generic Xml Parser for parsing large files. @LarsH — Akib, Sep 07 '15 at 04:44
Given that you have to check some things on descendants also, the best way is probably to keep track of state yourself as you go through the document sequentially using XmlReader. So it sounds like this question may be a duplicate after all. Did you go through all the answers at http://stackoverflow.com/questions/2441673/reading-xml-with-xmlreader-in-c-sharp ? — LarsH, Sep 07 '15 at 13:37
Yeah, probably I have to keep track of state myself. And, I have gone through this question. But, if you see all of them are loading an element node into memory. And, I have concerns of avoiding it. @LarsH — Akib, Sep 08 '15 at 04:27
OK. It sounds like the part you still have questions about is how to keep track of state while reading sequentially through a large XML file, in such a way that when you have accumulated enough info to know whether an element/subtree fulfills the extraction condition, you will still be able to perform whatever processing you need to perform on that subtree without backing up. Right? And the answer will be very specific to (1) the condition and (2) what processing you need to perform. So in order to get a useful answer, you may need to post a new question that includes details on (1) and (2). — LarsH, Sep 08 '15 at 15:41

How to parse large XML with .net without loading any node into memory?

0 Answers0