I'm looking to display a list of regexes that will match a text string.
I'm using the date as an example. Blank spaces are standing in for other text.
FindMatchRegex goes through the list of regexes. Because I don't know where the text will match within the regex, I match every substring of the regex. So starting with the whole string, I gradually reduce the regex by chopping one character off the front I check to see if it is a valid regex, then I check using PCRE regex to check for a partial or full match. If it is a partial or full match, add it to the list of possible matching regexes.
On .NET Fiddle, this is executing in about 1 second for 200 length longString and 200 regexes. On my Desktop, 16GB, i5-3570K 3.4GHz, it takes about 6 seconds.
I'm looking for a response time of around 0.5 seconds. How can I get a 10X or 100X improvement in speed?
What command or technique am I missing?
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using PCRE;
using System.Diagnostics;
public class Program
{
public static void Main()
{
List<string> regexList = new();
string longString = new string (' ', 200);
regexList.Add(longString + @"(\d|\d\d) December (\d\d\d\d) 1066");
regexList.Add(longString + @"(\d|\d\d) December (\d\d\d\d) 1999");
regexList.Add(longString + @"(\d|\d\d) December (\d\d\d\d) 2000");
regexList.Add(longString + @"(\d|\d\d) December (\d\d\d\d) 2020");
for (int i = 0; i < 200; i++)
regexList.Add(longString + @"(\d|\d\d) April (\d\d\d\d)");
string checkString = "1 December 1234 10";
// string checkString = "1 December 4567 1";
// string checkString = "1 December 1234 20";
string checkString = "1 December 1234";
Stopwatch stopwatch = new();
stopwatch.Start();
List<string> result = FindMatchRegex(checkString, regexList);
stopwatch.Stop();
foreach (var item in result)
{
Console.WriteLine(checkString + " found to match " + item);
}
Console.WriteLine("Time elapsed: " + stopwatch.Elapsed);
}
private static List<string> FindMatchRegex(string filter, List<string> regexList)
{
List<string> matchingRegexes = new();
for (int i = 0; i < regexList.Count; i++)
{
string currentRegex = regexList[i];
bool anyMatches = false;
int j = 0;
while (j < currentRegex.Length && anyMatches == false)
{
string currentRegexSubstring = currentRegex.Substring(j);
if (IsValidRegex(currentRegexSubstring))
{
var regex = new PcreRegex("^" + currentRegexSubstring);
var match = regex.Match(filter, PcreMatchOptions.PartialSoft);
anyMatches = anyMatches || match.IsPartialMatch || match.Success;
}
j++;
}
if (anyMatches == true)
{
matchingRegexes.Add(currentRegex);
}
}
return matchingRegexes;
}
private static bool IsValidRegex(string pattern)
{
if (string.IsNullOrWhiteSpace(pattern))
return false;
try
{
Regex.Match("", pattern);
}
catch (ArgumentException)
{
return false;
}
return true;
}
}
Edit
Purpose of program
I'm writing a translation program that uses in-house translations. Unique sentences match correctly and easily, but it gets tiresome to add a new translation for minor changes in dates or product item description. So the dictionary includes regexes to match the English to translate into a language. Perfect for dates and product items that don't change in the translation.
When a user wants to update a translation, rather than type of the whole English, they can just type in a part of the English to isolate the translation to update. So, I want to filter the list of terms from the dictionary to give the user a drop down list of matching terms.
As an example, if I type in "31 December 2020", I want a list of all English terms that match 31 December 2020, but if the dictionary is using a regex "... (\d|\d\d) December (\d\d\d\d) ..." it won't match on a text basis. I want to scan the dictionary so all English terms with the regex "(\d|\d\d) December (\d\d\d\d)" will also match.
Have I been coming at this problem the wrong way?
Edit
Example strings to translate
Part ABC has been replaced by part XYZ on 21 July 2010 because of defect notice section 18c
Part DEF has been replaced by part RST on 15 July 2009 because of defect notice section 17b
Part DEF has been replaced by part RST on 15 July 2008 because of defect notice section 15a
Regex to translate the string, I have about 200 of these at the moment, expected to increase.
Part ([A-Z][A-Z][A-Z]) has been replaced by part ([A-Z][A-Z][A-Z]) on (\d\d) July (\d\d\d\d) because of defect notice section (\d\d[a-z])
The translation
Language part $1 Language has been replaced by part $2 on $3 Language July $4 Language because of defect note section $5
Matching the strings and translating them works fine. If during proof-reading, we get notified that "Part ABC has been replaced by part XYZ on 21 July 2010 because of defect notice section 18c" is wrong then the user can type in "by part XYZ" and the program can bring up "Part ([A-Z][A-Z][A-Z]) has been replaced by part ([A-Z][A-Z][A-Z]) on (\d\d) July (\d\d\d\d) because of defect notice section (\d\d[a-z])" as a possible matching English term and then go and edit the translation.
There are times when the text comes to us slightly wrong, or with spelling mistakes or extra punctuation. So we already have the translation, but the text given to us is wrong. So a partial match of the regex will bring up the existing English term and we can query if the English term needs to change or it was a mistake given to us to translate. Apologies for not making all this clearer, sooner.
Edit
Thank you to JeffC for writing some code. I've updated his code to convey more correctly what I want the code to do.
I'm looking return partial matches. e.g "31 July" needs to match to "Hi there (\d{1,2}) July (\d{4}) this is a test" and "This is another (\d{1,2}) July (\d{4}) test". The user does not have to type in the whole string in order to find a match in the string containing regexes.
The reason my original code checks for a valid regex is that while chopping up the potential matching string with the regex, there can be more than one regex and when you try to match it throw an error. So I thought to check for valid regex rather than crash the program.
static void Main(string[] args)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
// build your list of regex string, ideally reading them in from a file or getting them from a db
List<string> regexList = new List<string>();
regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 1066 because of defect notice section (\d{1,2}[a-z])");
regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 1999 because of defect notice section (\d{1,2}[a-z])");
regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 2000 because of defect notice section (\d{1,2}[a-z])");
regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 2020 because of defect notice section (\d{1,2}[a-z])");
for (int i = 0; i < 10; i++)
{
regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) April (\d{4}) 2020 because of defect notice section (\d{1,2}[a-z])");
}
// if you aren't going to maintain a clean list, clean it now before we start testing
List<string> cleanRegexList = CleanRegexList(regexList);
string checkString = "1 July 2000 1"; // Expect 2 results
//string checkString = "1 July 2000 19"; // Expect 1 result
//string checkString = "1 July 2000 20"; // Expect 2 results
//string checkString = "1 July 2000 202"; // Expect 1 result
//string checkString = "1 July 2000"; // Expect 4 results
List<string> results = FindMatchRegex(checkString, cleanRegexList);
stopwatch.Stop();
foreach (string result in results)
{
Console.WriteLine(checkString + " found to match " + result);
}
Console.WriteLine("Time elapsed: " + stopwatch.Elapsed);
}
If the user types in "1 July 2000 1" I want the program to return two results; the strings that contain "... (\d{1,2}) July (\d{4}) 1066 ..." and "... (\d{1,2}) July (\d{4}) 1999 ..."
If the user types in "1 July 2000 19" I want the program to return one result; the string that contains "... (\d{1,2}) July (\d{4}) 1999 ..."
If the user types in "1 July 2000 20" I want the program to return two results; the strings that contain "... (\d{1,2}) July (\d{4}) 2000 ..." and "... (\d{1,2}) July (\d{4}) 2020 ..."
If the user types in "1 July 2000 202" I want the program to return one result; the string that contains "... (\d{1,2}) July (\d{4}) 2020 ..."
If the user types in "1 July 2000" I want the program to return four results.
If the user types in "1 April 2000" I want the program to return 200 results (or 10 at the moment for testing purposes). It seems throwing an exception takes a lot of time as the blank spaces version ran a lot quicker than these more real life strings that have multiple expressions in the string.
The results are put into a dropdown list that the user can select. They can type in "1 July 2000 19" to get a unique match or if they type in "1 July 2000 1", it narrows it down to just two choices. Note the extra " 19" is a stand in for anything else, not that I would put " 1066" or "1999" after the date. It just makes it easy to see and understand (for me). If it gives the results I expect then it should work on anything else.
Updating the translations is such a dreary job that anything to speed it up and make it more convenient would be welcome.
I hope that is clearer. Thank you for reading and trying to understand.