-2

I have a program that goes through thousands of files and has to check if they have the correct xml-format. The problem is that it takes ages to complete, and I think that's because of the type of xml reader I use.

In the Method below are 3 different versions which I tried, the first one is the fastest, but only by 5%. (the method does not need to check if the file is a xml)

private bool HasCorrectXmlFormat(string filePath)
{
    try
    {
        //-Version 1----------------------------------------------------------------------------------------
        XmlReader reader = XmlReader.Create(filePath, new XmlReaderSettings() { IgnoreComments = true, IgnoreWhitespace = true });

        string[] elementNames = new string[] { "DocumentElement", "Protocol", "DateTime", "Item", "Value" };

        int i = 0;

        while (reader.Read())
        {
            if (reader.NodeType == XmlNodeType.Element)
            {
                if (reader.Name != elementNames.ElementAt(i))
                {
                    return false;
                }

                if (i >= 4)
                {
                    return true;
                }

                i++;
            }

        }

        return false;
        //--------------------------------------------------------------------------------------------------


        //-  Version 2  ------------------------------------------------------------------------------------
        IEnumerable<XElement> xmlElements = XDocument.Load(filePath).Descendants();

        string[] elementNames = new string[] { "DocumentElement", "Protocol", "DateTime", "Item", "Value" };

        for (int i = 0; i < 5; i++)
        {
            if (xmlElements.ElementAt(i).Name != elementNames.ElementAt(i))
            {
                return false;
            }
        }

        return true;
        //--------------------------------------------------------------------------------------------------


        //-  Version 3  ------------------------------------------------------------------------------------
        XDocument doc = XDocument.Load(filePath);

        if (doc.Root.Name != "DocumentElement")
        {
            return false;
        }

        if (doc.Root.Elements().First().Name != "Protocol")
        {
            return false;
        }

        if (doc.Root.Elements().First().Elements().ElementAt(0).Name != "DateTime")
        {
            return false;
        }

        if (doc.Root.Elements().First().Elements().ElementAt(1).Name != "Item")
        {
            return false;
        }

        if (doc.Root.Elements().First().Elements().ElementAt(2).Name != "Value")
        {
            return false;
        }

        return true;
        //--------------------------------------------------------------------------------------------------
    }
    catch (Exception)
    {
        return false;
    }
}

What I need is a faster way to do this. Is there a faster way to go through a xml file? I only have to check if the first 5 Elements have the correct names.

UPDATE

The Xml-Files are only 2-5 KBs in size, rarely more than that. Files are located on a local server. I am on a laptop which has a ssd.

Here are some test results:

enter image description here

enter image description here

enter image description here

I should also add that the files are filtered before, so only xml files are given to the method. I get the files with the following Method:

public List<FileInfo> GetCompatibleFiles()
    {
        return new DirectoryInfo(folderPath)
                    .EnumerateFiles("*", searchOption)
                    .AsParallel()
                    .Where(file => file.Extension == ".xml" ? HasCorrectXmlFormat(file.FullName) : false)
                    .ToList();
    }

This Method is not in my code like this (it put two methods together), this is just to show how the HasCorrectXmlFormat Method is called. You dont have to correct this Method, I know it can be improved.

UDPATE 2

Here are the two full methods mentioned at the end of update 1:

private void WriteAllFilesInList()
    {
        allFiles = new DirectoryInfo(folderPath)
                    .EnumerateFiles("*", searchOption)
                    .AsParallel()
                    .ToList();
    }

private void WriteCompatibleFilesInList()
    {
        compatibleFiles = allFiles
                            .Where(file => file.Extension == ".xml" ? HasCorrectXmlFormat(file.FullName) : false)
                            .ToList();
    }

Both methods are only called once in the entire program (if either the allFiles or compatibleFiles List is null).

UPDATE 3

It seems like the WriteAllFilesInList Method is the real problem here, shown here:

enter image description here

FINAL UPDATE

As it seems, my method doesn't need any improvement as the bottleneck is something else.

baltermia
  • 1,151
  • 1
  • 11
  • 26
  • 1
    Write a schema file that does the validation you need and use a premade schema validator? – AKX Nov 04 '20 at 11:50
  • XML Serialization is very,very slow!!! I usually use Xml Linq which is faast. Never did a comparison between XmlReader and Xml Linq. Sometimes with huge files I use both together. – jdweng Nov 04 '20 at 11:53
  • 3
    You won't easily get something faster than an XmlReader. Did you check if the processing time is determined by I/O or by CPU? In the latter case, you might try to check multiple files in parallel. – Klaus Gütter Nov 04 '20 at 11:54
  • 2
    @jdweng Where does the OP use XML Seriaiization? – Klaus Gütter Nov 04 '20 at 11:55
  • 1
    Have you validated that the outer loop is correct? Maybe your are doing more work than you think you are doing? eg. reading the same file more than once – Rand Random Nov 04 '20 at 12:06
  • @RandRandom : Using a schema. – jdweng Nov 04 '20 at 12:11
  • @jdweng - don't really get how that would be a comment related to what I said?!? – Rand Random Nov 04 '20 at 12:12
  • 1
    Could you share the performance measurements of Version 1, Version 2 and Version 3? Also what is the average size of the XML files you are validating, what is the type of hardware storage you use (hard disk drive? SSD?), and where is it located (local machine, local network etc)? – Theodor Zoulias Nov 04 '20 at 12:28
  • 1
    The run time seems to scale more with the total number of files, not with the number of XML files. Is it really HasCorrectXmlFormat which is the bottleneck? – Klaus Gütter Nov 04 '20 at 12:53
  • 1
    Could you do an experiment? Measure the performance using this dummy `HasCorrectXmlFormat` version: `private bool HasCorrectXmlFormat(string filePath) => File.ReadAllText(filePath).Length > 1000;`. My guess is that the bottleneck of your program is not the XML parsing, but the network latency. If my guess is correct, then the dummy `HasCorrectXmlFormat` (that does no parsing at all) should perform more or less equal with the versions you have already tried. – Theodor Zoulias Nov 04 '20 at 12:54
  • That is a good point you're making @KlausGütter but I already improved the method to get the files as far as I can. Or maybe you know any better way? The way I get the files is shown in the update. Note that the method is not exactly like that in my code, it's two methods put together. – baltermia Nov 04 '20 at 12:59
  • Also, since you are already using PLINQ, you could experiment with various configurations regarding the degree of parallelism (`WithDegreeOfParallelism(1)`, `WithDegreeOfParallelism(2)`, `WithDegreeOfParallelism(4)` etc), and report the results. – Theodor Zoulias Nov 04 '20 at 13:05
  • @TheodorZoulias I guess you're right. Here's the result: _374, 27415, 89730_, compared to my V1: _378, 23876, 85513_. So what now? lol – baltermia Nov 04 '20 at 13:07
  • Hmmm... Is it an option to run your program directly on the local server where the files are located? This would reduce the latency greatly. – Theodor Zoulias Nov 04 '20 at 13:09
  • @TheodorZoulias you think the latency is the problem? Look at my update 2. Should I call `WriteAllFilesInList` and then look how long `WriteCompatibleFilesInList` takes? Any ways to improve these two methods? I already tried to improve them as much as I can but maybe you know a better way to do that. – baltermia Nov 04 '20 at 13:13
  • 1
    So its clear now that it makes no sense to concentrate on the XML parsing part as this contributes not much to the total run time. But improving the rest should IMHO be asked in a new question. – Klaus Gütter Nov 04 '20 at 13:16
  • speyck yes, it will definitely help to make more detailed measurements. Probably you'll get better performance by filtering initially `allXmlFiles` instead of `allFiles`. Also the `AsParallel()` should be more beneficial for the second query than the first. You have a lot of measurements to do! – Theodor Zoulias Nov 04 '20 at 13:19
  • 1
    @TheodorZoulias I will do that. Thank you very much for the help! – baltermia Nov 04 '20 at 13:23
  • @KlausGütter I will create a new question. Thank you for the help, I appreciate it. – baltermia Nov 04 '20 at 13:41
  • As in your [other question](https://stackoverflow.com/q/64681460/3744182), you are not disposing of your `XmlReader reader`, e.g. by doing `using var reader = XmlReader.Create(filePath, new XmlReaderSettings() { IgnoreComments = true, IgnoreWhitespace = true });`. That could impact performance as well as leak resources as you might have thousands of files open at once until / unless the finalizer kicks in and closes them. Assuming you do that `XmlReader` should be faster since you don't process the entire file, while both `XDocument` and `XmlDocument` use it to load the entire file. – dbc Nov 04 '20 at 17:37
  • @dbc I go through each file later in the program and search for specific values. I've never used using before so I dont know what it does. Should I still use it if I go through the file later again? Thanks for the help btw – baltermia Nov 05 '20 at 09:25
  • @speyck - The `using` statement is a c# language element, see the documentation [here](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/using-statement). For why to use it, see [What is the C# Using block and why should I use it?](https://stackoverflow.com/q/212198/3744182) and [What are the uses of “using” in C#?](https://stackoverflow.com/q/75401/3744182). In your code the `XmlReader reader` goes out of scope once execution leaves the `try` statement, so you aren't making further use of it. Thus, it should be disposed, e.g. by a `using` statement. – dbc Nov 05 '20 at 14:42

2 Answers2

0

I would write code like this using Xml Linq which is a little faster than your code. You code is looping through the xml file multiple times while mine is going through file only once.

    try
    {

        XDocument doc = XDocument.Load(filePath);
        XElement root = doc.Root;
        if (doc.Root.Name != "DocumentElement")
        {
            return false;
        }
        else
        {
            XElement protocol = root.Elements().First();
            if (protocol.Name != "Protocol")
            {
                return false;
            }
            else
            {
                XElement dateTime = protocol.Elements().First();
                if (dateTime.Name != "DateTime")
                {
                    return false;
                }
                XElement item = protocol.Elements().Skip(1).First();
                if (item.Name != "Item")
                {
                    return false;
                }
                XElement value = protocol.Elements().Skip(2).First();
                if (doc.Root.Elements().First().Elements().ElementAt(2).Name != "Value")
                {
                    return false;
                }
 
            }

        }
    }
    catch (Exception)
    {
        return false;
    }
    return true;
}
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • 1
    This mayl be a little bit faster than Version 3, but cannot beat XmlReader; see also https://stackoverflow.com/questions/2735434/performance-xmlreader-or-linq-to-xml – Klaus Gütter Nov 04 '20 at 12:15
  • Your linq is going through the file 4 times.. Mine only once. So your 15 msec with my code will be about the same as the XmlReader. – jdweng Nov 04 '20 at 12:28
  • @jdweng This is almost exactly as fast as my V2 (which is the slowest of my three versions). I appreciate the effort tho, thank you. – baltermia Nov 04 '20 at 12:54
  • How large is the Xml File? The XmlReader is reading dynamically while Xml Linq is read entire file. – jdweng Nov 04 '20 at 13:02
  • @jdweng only a few KBs (5KB max). But I found out that the bottleneck was actually the function that gets the files. – baltermia Nov 05 '20 at 07:32
0

Here is the example, which reads sample XML and shows comparison between Linq/XMlReader and XmlDocument

Linq is fastest.

Sample Code

using System;
using System.Diagnostics;
using System.Linq;
using System.Xml;
using System.Xml.Linq;

namespace ReadXMLInCsharp
{
  class Program
  {
    static void Main(string[] args)
    {
     
        //returns url of main directory which contains "/bin/Debug"
        var url=System.IO.Path.GetDirectoryName(
System.Reflection.Assembly.GetExecutingAssembly().GetName().CodeBase);
        
        //correction in path to point it in Root directory
        var mainpath = url.Replace("\\bin\\Debug", "") + "\\books.xml";

        var stopwatch = new Stopwatch();
        stopwatch.Start();

        //create XMLDocument object
        XmlDocument xmlDoc = new XmlDocument();
        //load xml file
        xmlDoc.Load(mainpath);
        //save all nodes in XMLnodelist
        XmlNodeList nodeList = xmlDoc.DocumentElement.SelectNodes("/catalog/book");

        //loop through each node and save it value in NodeStr
        var NodeStr = "";

        foreach (XmlNode node in nodeList)
        {
            NodeStr = NodeStr + "\nAuthor " + node.SelectSingleNode("author").InnerText;
            NodeStr = NodeStr + "\nTitle " + node.SelectSingleNode("title").InnerText;
            NodeStr = NodeStr + "\nGenre " + node.SelectSingleNode("genre").InnerText;
            NodeStr = NodeStr + "\nPrice " + node.SelectSingleNode("price").InnerText;
            NodeStr = NodeStr + "\nDescription -" + node.SelectSingleNode("description").InnerText;


        }
        //print all Authors details
        Console.WriteLine(NodeStr);
        stopwatch.Stop();
        Console.WriteLine();
        Console.WriteLine("Time elapsed using XmlDocument (ms)= " + stopwatch.ElapsedMilliseconds);
        Console.WriteLine();

        stopwatch.Reset();

        stopwatch.Start();
        NodeStr = "";
        //linq method
        //get all elements inside book
        foreach (XElement level1Element in XElement.Load(mainpath).Elements("book"))
        {
            //print each element value
            //you can also print XML attribute value, instead of .Element use .Attribute
            NodeStr = NodeStr + "\nAuthor " + level1Element.Element("author").Value;
            NodeStr = NodeStr + "\nTitle " + level1Element.Element("title").Value;
            NodeStr = NodeStr + "\nGenre " + level1Element.Element("genre").Value;
            NodeStr = NodeStr + "\nPrice " + level1Element.Element("price").Value;
            NodeStr = NodeStr + "\nDescription -" + level1Element.Element("description").Value;
        }

        //print all Authors details
        Console.WriteLine(NodeStr);
        stopwatch.Stop();
        Console.WriteLine();
        Console.WriteLine("Time elapsed using linq(ms)= " + stopwatch.ElapsedMilliseconds);
        Console.WriteLine();

        stopwatch.Reset();
        stopwatch.Start();
        //method 3
        //XMLReader
        XmlReader xReader = XmlReader.Create(mainpath);

        xReader.ReadToFollowing("book");
        NodeStr = "";
        while (xReader.Read())
        {
            switch (xReader.NodeType)
            {
                case XmlNodeType.Element:
                    NodeStr = NodeStr + "\nElement name:" + xReader.Name;
                    break;
                case XmlNodeType.Text:
                    NodeStr = NodeStr + "\nElement value:" + xReader.Value;
                    break;
                case XmlNodeType.None:
                    //do nothing
                    break;

            }
        }

        //print all Authors details
        Console.WriteLine(NodeStr);
        stopwatch.Stop();
        Console.WriteLine();
        Console.WriteLine("Time elapsed using XMLReader (ms)= " + stopwatch.ElapsedMilliseconds);
        Console.WriteLine();
        stopwatch.Reset();


        Console.ReadKey();
    }
  }
}

Output:

-- First Run
Time elapsed using XmlDocument (ms)= 15

Time elapsed using linq(ms)= 7

Time elapsed using XMLReader (ms)= 12

-- Second Run
Time elapsed using XmlDocument (ms)= 18

Time elapsed using linq(ms)= 3

Time elapsed using XMLReader (ms)= 15

I have removed some output to show only comparison data.

Source: Open and Read XML in C# (Examples using Linq, XMLReader, XMLDocument)

Edit: If i comment 'Console.WriteLine(NodeStr)' from all methods and prints only time comparison. This is what I get

Time elapsed using XmlDocument (ms)= 11


Time elapsed using linq(ms)= 0


Time elapsed using XMLReader (ms)= 0

Basically it depends on how you are processing the data and how you are reading XML. Linq/XML reader once look more promising in terms of speed.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Vikas Lalwani
  • 1,041
  • 18
  • 29
  • 3
    The test data is ridiculously small. And your results without the Console.WriteLine show that the measument method is not appropriate. – Klaus Gütter Nov 04 '20 at 12:35
  • @KlausGütter yes test data is small, but it is just for sample purposes, as I understand, we can take big XML and do same thing to check output. About Without Console.WriteLine, again this is just sample test, which I performed and mentioned everything here. – Vikas Lalwani Nov 04 '20 at 12:38