0

my xml file is around 7mb . I have to remove some invalid characters from some of nodes. there are many nodes like "title" , "country" and so on ..

I am having 31000 matches for "title" node and it is taking more than 35 mins . which not acceptable for my project requirements , How can I optimise this

method call

  fileText = RemoveInvalidCharacters(fileText, "title", @"(&#[xX]?[A-Fa-f\d]+;)|[^\w\s\/\;\&\.@-]", "$1");  

Method definition

private static string RemoveInvalidCharacters(string fileText, string nodeName, string regexPattern, string regexReplacement)
        {
            foreach (Match match in Regex.Matches(fileText, @"<" + nodeName + ">(.*)</" + nodeName + ">"))
            {
                var oldValue = match.Groups[0].Value;
                var newValue = "<" + nodeName + ">" + Regex.Replace(match.Groups[1].Value, regexPattern, regexReplacement) +
                               "</" + nodeName + ">";
                fileText = fileText.Replace(oldValue, newValue);
            }

            return fileText;
        }
Kuntady Nithesh
  • 11,371
  • 20
  • 63
  • 86
  • You cre calling `RemoveSpecialCharacters` but the method name is `RemoveInvalidCharacters` – Tim Schmelter Dec 01 '15 at 10:13
  • 2
    You can optimize this by *not using regex for XML parsing*. Use proper tools, .NET has XML parsers and even low-level XML readers/writers if full-blown parsers aren't fast enough. *(I must resist the temptation to dupehammer this to [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags))*. – Lucas Trzesniewski Dec 01 '15 at 10:13
  • I think the problem can be with `(.*)` subpattern. What if you use `Regex.Matches(fileText, @"<" + nodeName + @">([^<]*(?:<(?!" + nodeName + @">)[^<]*)*)" + nodeName + ">")`? – Wiktor Stribiżew Dec 01 '15 at 10:14
  • sorry @tim corrected now – Kuntady Nithesh Dec 01 '15 at 10:30
  • stribizhev . (.*) not a prob . it is fast . looping is taking time – Kuntady Nithesh Dec 01 '15 at 10:31
  • Definitely do use built-in XML tools as demonstrated by Chris Davis. Additionally, it might be possible to optimize your Regex to improve matching performance, if you tell us what exactly you want to achieve. – marsze Dec 01 '15 at 11:15

1 Answers1

1

Instead of using Regex to parse the Xml Document, you can use the tools in the System.Xml.Linq namespace to handle the parsing for you, which is inherently much faster and easier to use.

Here's an example program that takes a structure with 35,000 nodes in. I've kept your regex string to check for the bad characters, but I've specified it as a Compiled regex string, which should yield better performance, although admittedly, not a huge increase when I compared the two. More info.

This example uses Descendants, which gets references to all of the element you specify in the parameter within the element specified (in this case, we've started from the root element). Those results are filtered by the ContainsBadCharacters method.

For the sake of simplicity I haven't made the foreach loops DRY, but it's probably worth doing so.

On my machine, this runs in less than a second, but timings will vary based on machine performance and occurrences of bad characters.

using System;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Text.RegularExpressions;
using System.Xml.Linq;

namespace ConsoleApplication2
{
    class Program
    {
        static Regex r = new Regex(@"(&#[xX]?[A-Fa-f\d]+;)|[^\w\s\/\;\&\.@-]", RegexOptions.Compiled);

        static void Main(string[] args)
        {
            System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
            var xmls = new StringBuilder("<Nodes>");
            for(int i = 0;i<35000;i++)
            {
                xmls.Append(@"<Node>
                                  <Title>Lorem~~~~</Title>
                                  <Country>Ipsum!</Country>
                               </Node>");
            }
            xmls.Append("</Nodes>");

            var doc = XDocument.Parse(xmls.ToString());

            sw.Start();
            foreach(var element in doc.Descendants("Title").Where(ContainsBadCharacters))
            {               
                element.Value = r.Replace(element.Value, "$1");
            }
            foreach (var element in doc.Descendants("Country").Where(ContainsBadCharacters))
            {
                element.Value = r.Replace(element.Value, "$1");
            }
            sw.Stop();

            var saveFile = new FileInfo(Path.Combine(Assembly.GetExecutingAssembly().Location.Substring(0, 
                Assembly.GetExecutingAssembly().Location.LastIndexOf(@"\")), "test.txt"));
            if (!saveFile.Exists) saveFile.Create();

            doc.Save(saveFile.FullName);
            Console.WriteLine(sw.Elapsed);
            Console.Read();
        }

        static bool ContainsBadCharacters(XElement item)
        {
            return r.IsMatch(item.Value);
        }
    }
}
Community
  • 1
  • 1
Chris Davis
  • 433
  • 2
  • 10