0

I'm very very new to c#. I have currently written this piece of code that will essentially loop through paragraphs in a word document and search for any regex matches within a list of regexes. It starts off very fast but then becomes painfully slow after about 1000 paragraphs in. Does anyone know how I can optimise the below code to stop being so slow after looping many times?

Thanks in advance,

static List<MatchDetails> GetRegexMatchesContent(Word.Document document, List<string> regexes)
        {
            var matchDetails = new List<MatchDetails>(2000);
            // For each paragraph in word document, run regex searches.
            for (int i = 0; i < document.Paragraphs.Count; i++)
            {
                Console.WriteLine($"Processed Paragraph {i} of {document.Paragraphs.Count}");
                var rng = document.Paragraphs[i + 1].Range;
                // Loop through each regex input.
                foreach (string regex in regexes)
                {
                    // Match regexes on text.
                    MatchCollection matches = Regex.Matches(rng.Text, regex, RegexOptions.IgnoreCase);
                    foreach (Match match in matches)
                    {
                        foreach (Capture capture in match.Captures)
                        {
                            {
                                if (capture.Value != "")
                                {
                                    // Retrieve regex match information and save to MatchDetails List.
                                    string matchText = capture.Value;
                                    int matchPageNo = rng.Information[Word.WdInformation.wdActiveEndPageNumber];
                                    int matchLineNo = rng.Information[Word.WdInformation.wdFirstCharacterLineNumber];
                                    int matchCharNo = capture.Index;

                                    MatchDetails matchRow = new MatchDetails { matchText = matchText, documentSection = "Body Text", pageNo = matchPageNo, lineNo = matchLineNo, charNo = matchCharNo };
                                    matchDetails.Add(matchRow);
                                }
                            }
                        }
                    }
                }
            }

            return matchDetails;
        }
adan11
  • 647
  • 1
  • 7
  • 24
  • Maybe try removing he `Console.WriteLines`,if the doc is very large, writing to the console will slow it down significantly – JamesS Jan 24 '22 at 10:00
  • Oh okay. Is there a better way I can track progress though? – adan11 Jan 24 '22 at 10:04
  • Check memory usage in task manager as code is run. The variable matchDetails is growing which uses memory. I would also remove the 2000 from the initialization. – jdweng Jan 24 '22 at 10:06
  • I believe that `Debug.WriteLine` is slightly faster – JamesS Jan 24 '22 at 10:06
  • Is there a better way I can hold the matchDetails data? Possibly in a list of lists? – adan11 Jan 24 '22 at 10:07
  • 1
    How's this part doing in time execution? var rng = document.Paragraphs[i + 1].Range; AS your code didn't have any part that seems iteration dependant (you didn't check matchDetails before adding , ...) – J.Salas Jan 24 '22 at 10:07
  • It depends on where is yout bottleneck... Long time ago I bokkmarked [this answer](https://stackoverflow.com/questions/70122710/c-sharp-foreach-loop-comically-slower-than-for-loop-on-a-raspberrypi) that helps checking performance. Also you could note the [Regex documentation](https://learn.microsoft.com/en-us/dotnet/standard/base-types/compilation-and-reuse-in-regular-expressions), it has a performance specific section you could find useful – Cleptus Jan 24 '22 at 10:08
  • @MitchWheat I don't think so – adan11 Jan 24 '22 at 10:18
  • 3
    Check if there is an iterator or something to get the next paragraph, instead of indexing into the paragraphs. Perhaps it traverses all previous paragraphs at each loop iteration when indexing. – Bent Tranberg Jan 24 '22 at 10:18
  • Thankyou @BentTranberg - that looked to be the issue. :) I am now using foreach(Word.Paragraph paragraph in document.Paragraphs) and it is a lot faster. – adan11 Jan 24 '22 at 10:30

0 Answers0