-1

What my program basically does is that it searches through xml's and returns the filenames of those which have specific values in a element.

I guess I have to show you my xml first before I can continue:

 <DocumentElement>
   <Protocol>
     <DateTime>10.03.2003</DateTime>
     <Item>Date</Item>
     <Value />
   </Protocol>
   <Protocol>
     <DateTime>05.11.2020</DateTime>
     <Item>Status</Item>
     <Value>Ok</Value>
   </Protocol>
 </DocumentElement>

I have a few thousand xml files whch have this exact layout. The user can get a list of all the files with the following method:

public List<string> GetFiles(string itemValue, string element, string value)
{
    return compatibleFiles.Where(path => XmlHasValue(path, itemValue, element, value)).ToList();
}

And this methods returns wether the xml has the wanted value or not:

private bool XmlHasValue(string filePath, string itemValue, string element, string value)
{
    try
    {
        string foundValue = XDocument.Load(filePath)
            .Descendants()
            .Where(el => el.Name == "Item" && el.Value == itemValue)
            .First()
            .Parent
            .Descendants()
            .Where(des => des.Name == element && des.Value == value)
            .First()
            .Value;
         return foundValue == value;
    }
    catch (Exception)
    {
        return false;
    }
}

compatibleFiles is a list with all the paths to xml files that have the correct layout/format (xml code above). The user provides the GetFiles method the following:

  • itemValue -> value the 'Item' element should have, "Status" for example
  • element -> name of the element he want's to check (in the same 'Protocol' element), f.E. "Value" or "Date"
  • value -> value of the element element, "Ok" in our example

The problem is, that these methods take a long time to complete, and I'm almost certain there's a better and faster way to do what I want. I don't know if GetFiles can get any faster but XmlHasValue sure can. Here are some test-results:

enter image description here

Do you guys know any faster way to do this? It would be really helpful.

UPDATE

Turns out that it was all just because of the IO thread. If you have the same problem and think your code is bad, you should first check if it's just a thread using all the cpu power.

baltermia
  • 1,151
  • 1
  • 11
  • 26
  • 5
    Use [profiler](https://stackoverflow.com/q/3927/1997232) first. If 99% of time is IO and you already have SSD then there is nothing to improve. Unless you really want 19456 ms instead of 19546 ms. – Sinatr Nov 05 '20 at 13:47
  • did you try to use xpath queries? Example: `XmlNodeList xnList = xml.SelectNodes("/Names/Name");` – Rumplin Nov 05 '20 at 14:00
  • Hi speyck. How is this question different than the one you asked yesterday? ==> [Absolutely fastest way to go through a xml?](https://stackoverflow.com/questions/64679492/absolutely-fastest-way-to-go-through-a-xml) – Theodor Zoulias Nov 05 '20 at 16:12
  • If you are interested exclusively at optimizing the XML-parsing performance, you shouldn't load the XML directly from the filesystem with `XDocument.Load`. First read the text of the file with `File.ReadAllText` in a string variable, and then use the [`XDocument.Parse`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.linq.xdocument.parse) method to parse the string. This way you'll be able to get accurate measurements for the part you are interested for. – Theodor Zoulias Nov 05 '20 at 16:19

1 Answers1

1

As @Sinatr mentions. Profiling should always be the first step when investigating performance.

A reasonable guess about what takes time would be

  1. IO
  2. Parsing

IO could be improved by getting a faster disk, or caching results in RAM. The later may greatly improve performance if multiple searches are done, but introduces issues like cache-invalidation.

According to "What is the best way to parse (big) XML in C# Code" XmlReader is the fastest way to parse xml. This blog suggest XmlReader is about 2.5 times faster.

If you have multiple files you could also try to process multiple files in parallel. Keep in mind IO is mostly serial, so you might not gain anything unless you have a SSD that can deliver data faster than files can be processed.

JonasH
  • 28,608
  • 2
  • 10
  • 23