-1

I'm looking to display a list of regexes that will match a text string.

I'm using the date as an example. Blank spaces are standing in for other text.

FindMatchRegex goes through the list of regexes. Because I don't know where the text will match within the regex, I match every substring of the regex. So starting with the whole string, I gradually reduce the regex by chopping one character off the front I check to see if it is a valid regex, then I check using PCRE regex to check for a partial or full match. If it is a partial or full match, add it to the list of possible matching regexes.

On .NET Fiddle, this is executing in about 1 second for 200 length longString and 200 regexes. On my Desktop, 16GB, i5-3570K 3.4GHz, it takes about 6 seconds.

I'm looking for a response time of around 0.5 seconds. How can I get a 10X or 100X improvement in speed?

What command or technique am I missing?

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using PCRE;
using System.Diagnostics;

public class Program
{
    public static void Main()
    {
        List<string> regexList = new();
        string longString = new string (' ', 200);
        regexList.Add(longString + @"(\d|\d\d) December (\d\d\d\d) 1066");
        regexList.Add(longString + @"(\d|\d\d) December (\d\d\d\d) 1999");
        regexList.Add(longString + @"(\d|\d\d) December (\d\d\d\d) 2000");
        regexList.Add(longString + @"(\d|\d\d) December (\d\d\d\d) 2020");
        for (int i = 0; i < 200; i++)
            regexList.Add(longString + @"(\d|\d\d) April (\d\d\d\d)");
                string checkString = "1 December 1234 10"; 
        //      string checkString = "1 December 4567 1";
        //      string checkString = "1 December 1234 20"; 
        string checkString = "1 December 1234";
        Stopwatch stopwatch = new();
        stopwatch.Start();
        List<string> result = FindMatchRegex(checkString, regexList);
        stopwatch.Stop();
        foreach (var item in result)
        {
            Console.WriteLine(checkString + " found to match " + item);
        }

        Console.WriteLine("Time elapsed: " + stopwatch.Elapsed);
    }

    private static List<string> FindMatchRegex(string filter, List<string> regexList)
    {
        List<string> matchingRegexes = new();
        for (int i = 0; i < regexList.Count; i++)
        {
            string currentRegex = regexList[i];
            bool anyMatches = false;
            int j = 0;
            while (j < currentRegex.Length && anyMatches == false)
            {
                string currentRegexSubstring = currentRegex.Substring(j);
                if (IsValidRegex(currentRegexSubstring))
                {
                    var regex = new PcreRegex("^" + currentRegexSubstring);
                    var match = regex.Match(filter, PcreMatchOptions.PartialSoft);
                    anyMatches = anyMatches || match.IsPartialMatch || match.Success;
                }

                j++;
            }

            if (anyMatches == true)
            {
                matchingRegexes.Add(currentRegex);
            }
        }

        return matchingRegexes;
    }

    private static bool IsValidRegex(string pattern)
    {
        if (string.IsNullOrWhiteSpace(pattern))
            return false;
        try
        {
            Regex.Match("", pattern);
        }
        catch (ArgumentException)
        {
            return false;
        }

        return true;
    }
}

Edit

Purpose of program

I'm writing a translation program that uses in-house translations. Unique sentences match correctly and easily, but it gets tiresome to add a new translation for minor changes in dates or product item description. So the dictionary includes regexes to match the English to translate into a language. Perfect for dates and product items that don't change in the translation.

When a user wants to update a translation, rather than type of the whole English, they can just type in a part of the English to isolate the translation to update. So, I want to filter the list of terms from the dictionary to give the user a drop down list of matching terms.

As an example, if I type in "31 December 2020", I want a list of all English terms that match 31 December 2020, but if the dictionary is using a regex "... (\d|\d\d) December (\d\d\d\d) ..." it won't match on a text basis. I want to scan the dictionary so all English terms with the regex "(\d|\d\d) December (\d\d\d\d)" will also match.

Have I been coming at this problem the wrong way?

Edit

Example strings to translate

Part ABC has been replaced by part XYZ on 21 July 2010 because of defect notice section 18c

Part DEF has been replaced by part RST on 15 July 2009 because of defect notice section 17b

Part DEF has been replaced by part RST on 15 July 2008 because of defect notice section 15a

Regex to translate the string, I have about 200 of these at the moment, expected to increase.

Part ([A-Z][A-Z][A-Z]) has been replaced by part ([A-Z][A-Z][A-Z]) on (\d\d) July (\d\d\d\d) because of defect notice section (\d\d[a-z])

The translation

Language part $1 Language has been replaced by part $2 on $3 Language July $4 Language because of defect note section $5

Matching the strings and translating them works fine. If during proof-reading, we get notified that "Part ABC has been replaced by part XYZ on 21 July 2010 because of defect notice section 18c" is wrong then the user can type in "by part XYZ" and the program can bring up "Part ([A-Z][A-Z][A-Z]) has been replaced by part ([A-Z][A-Z][A-Z]) on (\d\d) July (\d\d\d\d) because of defect notice section (\d\d[a-z])" as a possible matching English term and then go and edit the translation.

There are times when the text comes to us slightly wrong, or with spelling mistakes or extra punctuation. So we already have the translation, but the text given to us is wrong. So a partial match of the regex will bring up the existing English term and we can query if the English term needs to change or it was a mistake given to us to translate. Apologies for not making all this clearer, sooner.

Edit

Thank you to JeffC for writing some code. I've updated his code to convey more correctly what I want the code to do.

I'm looking return partial matches. e.g "31 July" needs to match to "Hi there (\d{1,2}) July (\d{4}) this is a test" and "This is another (\d{1,2}) July (\d{4}) test". The user does not have to type in the whole string in order to find a match in the string containing regexes.

The reason my original code checks for a valid regex is that while chopping up the potential matching string with the regex, there can be more than one regex and when you try to match it throw an error. So I thought to check for valid regex rather than crash the program.

static void Main(string[] args)
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();

    // build your list of regex string, ideally reading them in from a file or getting them from a db
    List<string> regexList = new List<string>();
    regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 1066 because of defect notice section (\d{1,2}[a-z])");
    regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 1999 because of defect notice section (\d{1,2}[a-z])");
    regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 2000 because of defect notice section (\d{1,2}[a-z])");
    regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 2020 because of defect notice section (\d{1,2}[a-z])");
    for (int i = 0; i < 10; i++)
    {
        regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) April (\d{4}) 2020 because of defect notice section (\d{1,2}[a-z])");
    }

    // if you aren't going to maintain a clean list, clean it now before we start testing
    List<string> cleanRegexList = CleanRegexList(regexList);

    string checkString = "1 July 2000 1"; // Expect 2 results
    //string checkString = "1 July 2000 19"; // Expect 1 result
    //string checkString = "1 July 2000 20"; // Expect 2 results
    //string checkString = "1 July 2000 202"; // Expect 1 result
    //string checkString = "1 July 2000"; // Expect 4 results

    List<string> results = FindMatchRegex(checkString, cleanRegexList);
    stopwatch.Stop();

    foreach (string result in results)
    {
        Console.WriteLine(checkString + " found to match " + result);
    }

    Console.WriteLine("Time elapsed: " + stopwatch.Elapsed);
}

If the user types in "1 July 2000 1" I want the program to return two results; the strings that contain "... (\d{1,2}) July (\d{4}) 1066 ..." and "... (\d{1,2}) July (\d{4}) 1999 ..."

If the user types in "1 July 2000 19" I want the program to return one result; the string that contains "... (\d{1,2}) July (\d{4}) 1999 ..."

If the user types in "1 July 2000 20" I want the program to return two results; the strings that contain "... (\d{1,2}) July (\d{4}) 2000 ..." and "... (\d{1,2}) July (\d{4}) 2020 ..."

If the user types in "1 July 2000 202" I want the program to return one result; the string that contains "... (\d{1,2}) July (\d{4}) 2020 ..."

If the user types in "1 July 2000" I want the program to return four results.

If the user types in "1 April 2000" I want the program to return 200 results (or 10 at the moment for testing purposes). It seems throwing an exception takes a lot of time as the blank spaces version ran a lot quicker than these more real life strings that have multiple expressions in the string.

The results are put into a dropdown list that the user can select. They can type in "1 July 2000 19" to get a unique match or if they type in "1 July 2000 1", it narrows it down to just two choices. Note the extra " 19" is a stand in for anything else, not that I would put " 1066" or "1999" after the date. It just makes it easy to see and understand (for me). If it gives the results I expect then it should work on anything else.

Updating the translations is such a dreary job that anything to speed it up and make it more convenient would be welcome.

I hope that is clearer. Thank you for reading and trying to understand.

Alan
  • 306
  • 2
  • 10
  • Can you elaborate on the why a bit more? There could be insights in it. In that spirit: Is the prefix string (currently spaces) always the same, is it varying and/or could it contain other regexes? Is the regex you're interested in always at the end of the "long" string? – sunside Apr 18 '21 at 15:50
  • 1
    This seem like a random task with no obvious purpose. So like sunside says, it would be nice to know a bit more about the problem you are trying to solve here - i.e. **why** are you doing this? – Xerillio Apr 18 '21 at 16:17
  • I still don't really understand what you are trying to do. Some concrete examples might help. Give a few examples of input strings and what you are looking for as output. – JeffC Apr 19 '21 at 15:46
  • @sunside I've added an example with strings. There is no particular prefix. It's a string containing one or more regexes. Dates generally come at the end but can also be at the beginning. Thanks. – Alan Apr 21 '21 at 19:08
  • @Xerillio I'm doing this to stop user frustration. It's like validating user input against a regex except I have over 100 regexes and the input doesn't have to start at the beginning. If the user looks at the incorrect translation, they search it up by typing in 31 December 2020. It he doesn't see (\d|\d\d) December (\d\d\d\d) as an existing match they will just enter a new redundant translation! If then they try to enter the translation, I can error check, see if it matches an English term then kick out an error; the user says "What a stupid program, can't recognise a regex!!" – Alan Apr 21 '21 at 19:15
  • @JeffC I've added some string examples. The output is to get a list of all regexes that can match the input text. I can put the regexes into a dropdown the user can easily select. I hope the above example I've made is clear. – Alan Apr 21 '21 at 19:18
  • Given your update, this really doesn't have anything to do with regex... it has to do with strings and string matching. Those strings just happen to be regexs. You have a collection of strings (regexs) and you want the user to type a string (a substring of a string in the collection) and then you want to find any string in your collection that matches that substring. Is that correct? – JeffC Apr 21 '21 at 22:14
  • @JeffC Thank you for taking the time to understand the problem and express it better than I can. That is right. The list of strings I want to match against is also regexes. Sorry I haven't been able to make that clearer sooner. – Alan Apr 22 '21 at 09:53

3 Answers3

0

To be honest, I still don't quite understand the purpose of it all. I think perhaps an explanation seen more from a user's perspective would help. Right now the explanation focuses heavily on the technical functionality you want to make work, while I think a more interesting question is what you need from a user's perspective.

That being said, what I do think I've been able to understand is that you want to produce a list of strings (in this case a list of regular expressions) that match some given input string. Since you don't need a perfect match it makes me think of fuzzy string matching and Levenshtein's distance formula.

To make life a little easier on yourself I'd suggest looking for a library that can do this for you. A quick search led me to JakeBayer's FuzzySharp which you can also find on NuGet. I haven't used this before so you should simply see it as a random example.

In that library it seems to allow you to for example pick top X strings that match some input. In your case that could look like:

Process.ExtractTop(
    "Part ABC has been replaced by part XYZ on 21 July 2010 because of defect notice section 18c",
    regexList,
    limit: 10);

... which will give you the 10 best matching regular expressions. You can also adjust how it calculates the score (how well the regex matches) by adding a scorer parameter to the call above, if the default isn't to your liking.

Matching on strings "fuzzily" allows room for misspelling etc. while still giving you a chance to find the right regex you're looking for.

Xerillio
  • 4,855
  • 1
  • 17
  • 28
  • Thank you for suggesting Fuzzy matching but in this case, it's not going to work. The example Fuzz.Ratio("mysmilarstring","mysimilarstring") 97, I would not have that match because one is not a subset of the other. I would want Fuzz.Ratio("mys","mysimilarstring") to match and Fuzz.Ratio("mysm","mysimilarstring") to Not match so I don't think it's going to work unless you see know of an option around that which I can't see at the moment. Thank you for the suggestion. – Alan Apr 22 '21 at 20:38
  • @Alan that's ok. I just understood from your question that you would also like to handle spelling mistakes. If you have spelling mistakes, then fuzzy matching makes it easier for you to find relevant regexes. In your example, your own current implementation will also make `mysm` and `mysimilarstring` match since both contain `mys` as I see it – Xerillio Apr 23 '21 at 16:46
  • My lack of clarity. It would be great to copy and paste the whole English in and check but a spelling or punctuation difference would break it. If you suspect the English should be in the dictionary but it's not there (but that translated before!?), one way around is to paste part of the sentence to bring up the matching whole sentence and try to discover why when you copy and paste the whole English, it didn't show up. Agree about "mys" but as soon as you try "mysm" against "mysimilarstring" it won't match. Thank you again for spending the time looking at my problem. – Alan Apr 23 '21 at 22:00
0

I would break this task into two parts for maximum speed and maintainability...

  1. Regexes should be validated when added to the list/catalog (outside of your current program)
  2. Run regexes against string to find matches

Step 1

I don't know how you store the list of regexes now but I would maintain them in some sort of flat file (text file) or db. Have a process where when a regex is added to the file, it gets validated, e.g. your IsValidRegex() method. This makes the maintenance task much easier and you know that when you load the list of regexes into your program, you don't have to check (again) to see that they are valid.

If you can't or don't want to do this, do your cleanup once at the start of the program and then if you test multiple strings, the cleaning only happens once.

Having said all this... in this case with just 200 regexes, it didn't make any speed difference. But, as this count grows, it will likely get slower and I think it's just better process to keep the two parts separate.

Side note: I'm assuming you got the IsValidRegex() method from here. The guy that provided that answer, Jeff Atwood, is one of the founders of SO.

Step 2

There are some bits I don't understand like the 2020 in the regex below.

(\d|\d\d) December (\d\d\d\d) 2020

Unless I'm missing something, the regex should not contain the 2020, that should be in the test string, e.g.

Regex: (\d|\d\d) December (\d\d\d\d)
Test string: 25 December 2020

Also, you are loading all of your regexes with 200 spaces at the start. Those 200 spaces are now part of the regex and must be matched. It seems to me that you would want to find those dates anywhere in the string, not just after 200 spaces.

That said, you seem to be new with regexes.

(\d|\d\d) December (\d\d\d\d) 1066

can be simplified to

(\d{1,2}) December (\d{4}) 1066

I would suggest that you look at regex101.com. It will help you test your regexes and also in the right panel, breaks your regex down to individual characters and explains what they mean as well as has a large reference section.

I was also confused about your FindMatchRegex() method. You seem to be walking one character at a time through the string until you find a match. Unless there's something I'm missing, that kinda defeats the purpose of regex. If you want to match the date format anywhere in the string, you make the regex as simple and short as possible so that it will match all instances regardless of position in the test string.

Hopefully that all makes sense. I've rewritten your code into the below. When I run it, it completes in ~10ms... even with the StopWatch at the start of the code AND cleaning all 200 regexes.

This outputs

1 December 2020 found to match (\d{1,2}) December (\d{4})
Time elapsed: 00:00:00.0083255

Main

public static void Main()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();

    // build your list of regex string, ideally reading them in from a file or getting them from a db
    List<string> regexList = new List<string>();
    regexList.Add(@"(\d{1,2}) December (\d{4})");
    for (int i = 0; i < 200; i++)
    {
        regexList.Add(@"(\d{1,2}) April (\d{4})");
    }

    // if you aren't going to maintain a clean list, clean it now before we start testing
    List<string> cleanRegexList = CleanRegexList(regexList);

    string checkString = "1 December 2020";
    // string checkString = "15 April 2020";

    List<string> results = FindMatchRegex(checkString, cleanRegexList);
    stopwatch.Stop();

    foreach (string result in results)
    {
        Console.WriteLine(checkString + " found to match " + result);
    }

    Console.WriteLine("Time elapsed: " + stopwatch.Elapsed);
}

Support methods

private static List<string> FindMatchRegex(string checkString, List<string> regexList)
{
    List<string> matchingRegexes = new List<string>();
    foreach (string regexString in regexList)
    {
        Regex regex = new Regex(regexString);
        if (regex.IsMatch(checkString))
        {
            matchingRegexes.Add(regexString);
        }
    }

    return matchingRegexes;
}

private static List<string> CleanRegexList(List<string> regexList)
{
    List<string> cleanRegexes = new List<string>();
    foreach (string regexString in regexList)
    {
        if (IsValidRegex(regexString))
        {
            cleanRegexes.Add(regexString);
        }
    }

    return cleanRegexes;
}

I did not modify IsValidRegex().

JeffC
  • 22,180
  • 5
  • 32
  • 55
  • Thank you very much for writing the code. Thank you for your tips on regex. Yes, I'm new to regex. Regex is the best thing in the world to stop having unique translation strings :-) Unfortunately, the regex can be a minor part of the whole text string and there can be multiple regexes in the string as various parts to the English string can carry over into the translation without alteration. I've updated your code to illustrate what I mean. Yes, I picked up the IsValidRegex to check the partial regex strings I was matching. Thanks. – Alan Apr 22 '21 at 19:34
0

I've been inspired by JeffC suggestion of splitting it up. After finding out that the majority of time is caused by the exception handling, that can be done at dictionary entry time. Its about 28 seconds for 14 items, about 2 second per item. So when you first enter the dictionary entry, there is a 2 second delay before all the valid regexes from that entry is created. I can store that somewhere (waves hand!). With a 4 core processor, I should be able to parallel process it to bring it down to under a second.

If there is a better way to generate the partial valid substrings of the regex then please let me know.

Anyway, it turns out matching the string against regexes is quite quick.

Setting up time elapsed: 00:00:28.6490097

Finding regex time elapsed: 00:00:00.0633832

1 July 2000 1 found to match Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 1066 because of defect notice section (\d{1,2}[a-z])

1 July 2000 1 found to match Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 1999 because of defect notice section (\d{1,2}[a-z])

static void Main(string[] args)
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();

    // build your list of regex string, ideally reading them in from a file or getting them from a db
    List<string> regexList = new List<string>();
    regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 1066 because of defect notice section (\d{1,2}[a-z])");
    regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 1999 because of defect notice section (\d{1,2}[a-z])");
    regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 2000 because of defect notice section (\d{1,2}[a-z])");
    regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) July (\d{4}) 2020 because of defect notice section (\d{1,2}[a-z])");
    for (int i = 0; i < 10; i++)
    {
        // Needs to be unique for dictionary
        regexList.Add(@"Part ([A-Z]{3}) has been replaced by part ([A-Z]{3}) on (\d{1,2}) April (\d{4}) " + i.ToString() + @" because of defect notice section (\d{1,2}[a-z])");
    }

    // I'll store this somewhere
    Dictionary<string, List<string>> regexDictionary = FindAllRegexStrings(regexList);

    string checkString = "1 July 2000 1"; // Expect 2 results
    //string checkString = "1 July 2000 19"; // Expect 1 result
    //string checkString = "1 July 2000 20"; // Expect 2 results
    //string checkString = "1 July 2000 202"; // Expect 1 result
    //string checkString = "1 July 2000"; // Expect 4 results

    //List<string> results = FindMatchRegex(checkString, cleanRegexList);
    stopwatch.Stop();
    Console.WriteLine("Setting up time elapsed: " + stopwatch.Elapsed);

    stopwatch.Reset();
    stopwatch.Start();

    List<string> results = FindMatchRegex(checkString, regexDictionary);

    stopwatch.Stop();
    Console.WriteLine("Finding regex time elapsed: " + stopwatch.Elapsed);

    foreach (string result in results)
    {
        Console.WriteLine(checkString + " found to match " + result);
    }
}

private static List<string> FindMatchRegex(string checkString, Dictionary<string, List<string>> regexDictionary)
{
    List<string> matchingRegexes = new();

    foreach (var regexEntry in regexDictionary)
    {
        List<string> currentRegexes = regexEntry.Value;
        bool anyMatches = false;

        for (int j = 0; j < currentRegexes.Count && anyMatches == false; j++)
        {
            string currentRegexSubstring = currentRegexes[j];
            var regex = new PcreRegex("^" + currentRegexSubstring);
            var match = regex.Match(checkString, PcreMatchOptions.PartialSoft);
            anyMatches = anyMatches || match.IsPartialMatch || match.Success;
        }

        if (anyMatches == true)
        {
            matchingRegexes.Add(regexEntry.Key);
        }
    }
    return matchingRegexes;
}

private static Dictionary<string, List<string>> FindAllRegexStrings(List<string> regexList)
{
    Dictionary<string, List<string>> regexDictionary = new();

    for (int i = 0; i < regexList.Count; i++)
    {
        regexDictionary.Add(regexList[i], ValidRegexes(regexList[i]));
    }

    return regexDictionary;
}

private static List<string> ValidRegexes(string regexString)
{
    List<string> listRegexes = new();
    for (int i = 0; i < regexString.Length; i++)
    {
        string currentRegexSubstring = regexString.Substring(i);
        if (IsValidRegex(currentRegexSubstring))
        {
            listRegexes.Add(currentRegexSubstring);
        }
    }
    return listRegexes;
}

private static bool IsValidRegex(string pattern)
{
    if (string.IsNullOrWhiteSpace(pattern))
        return false;
    try
    {
        Regex.Match("", pattern);
    }
    catch (ArgumentException)
    {
        return false;
    }

    return true;
}

Seriously, if there is a better way to find all the valid regex substrings at data entry time then please let me know. I've kind of hidden the problem, pending a better solution, but at least the interface will be responsive (note to self, read up on parallel processing!) :-)

Please let me know if there is a better solution than my clunky method of using exception handling to find all the valid regex substrings to match against.

Thank you to everyone who has contributed to this thread.

Alan
  • 306
  • 2
  • 10