9

I want to read a large xml file (100+M). Due to its size, I do not want to load it in memory using XElement. I am using linq-xml queries to parse and read it.

What's the best way to do it? Any example on combination of XPath or XmlReader with linq-xml/XElement?

Please help. Thanks.

Jon Seigel
  • 12,251
  • 8
  • 58
  • 92
hIpPy
  • 4,649
  • 6
  • 51
  • 65

3 Answers3

9

Yes, you can combine XmlReader with the method XNode.ReadFrom, see the example in the documentation which uses C# to selectively process nodes found by the XmlReader as an XElement.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • brilliant. i'm developing an app that will be processing multiple 200M XML files and XDocument was killing me. this has made a huge improvement. thanks. – Mike Jacobs Apr 21 '10 at 19:57
  • 4
    I think there's a bug in the example code on the `XNode.ReadFrom` documentation page. The statement `XElement el = XElement.ReadFrom(reader) as XElement;` should be `XElement el = new XElement(reader.Name, reader.Value);` instead. As-is, the first of every two 'Child' elements are skipped in the XML file from which it reads. – Kenny Evitt Aug 16 '13 at 19:54
  • 1
    See my answer for code that works for me; see [this answer](http://stackoverflow.com/a/2299683/173497) by [Jon Skeet](http://stackoverflow.com/users/22656/jon-skeet) for an explanation of why the two 'read' methods shouldn't be mixed. [Jon's answer doesn't mention the `XNode.ReadFrom` method explicitly, but I'm confident that the same issue applies.] – Kenny Evitt Aug 16 '13 at 21:01
7

The example code in the MSDN documentation for the XNode.ReadFrom method is as follows:

class Program
{
    static IEnumerable<XElement> StreamRootChildDoc(string uri)
    {
        using (XmlReader reader = XmlReader.Create(uri))
        {
            reader.MoveToContent();
            // Parse the file and display each of the nodes.
            while (reader.Read())
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        if (reader.Name == "Child")
                        {
                            XElement el = XElement.ReadFrom(reader) as XElement;
                            if (el != null)
                                yield return el;
                        }
                        break;
                }
            }
        }
    }

    static void Main(string[] args)
    {
        IEnumerable<string> grandChildData =
            from el in StreamRootChildDoc("Source.xml")
            where (int)el.Attribute("Key") > 1
            select (string)el.Element("GrandChild");

        foreach (string str in grandChildData)
            Console.WriteLine(str);
    }
}

But I've found that the StreamRootChildDoc method in the example needs to be modified as follows:

    static IEnumerable<XElement> StreamRootChildDoc(string uri)
    {
        using (XmlReader reader = XmlReader.Create(uri))
        {
            reader.MoveToContent();
            // Parse the file and display each of the nodes.
            while (!reader.EOF)
            {
                if (reader.NodeType == XmlNodeType.Element && reader.Name == "Child")
                {
                    XElement el = XElement.ReadFrom(reader) as XElement;
                    if (el != null)
                        yield return el;
                }
                else
                {
                    reader.Read();
                }
            }
        }
    }
Kenny Evitt
  • 9,291
  • 5
  • 65
  • 93
  • 2
    Yes. The first example doesn't work. It will read too much and skip every other "Child" – GHZ Feb 03 '20 at 09:33
1

Just keep in mind that you will have to read the file sequentially and referring to siblings or descendants is going to be slow at best and impossible at worst. Otherwise @MartinHonnn has the key.

No Refunds No Returns
  • 8,092
  • 4
  • 32
  • 43