Inverse regex processing to produce regex phrase

Question

We take the normal regex processor and pass the input text and the regex phrase to capture the desired output text.

output = the_normal_regex(
         input = "12$abc@#EF345", 
         phase = "\d+|[a-zA-Z]+") 
       = ["12", "abc", "EF", "345"]

Can we inverse the processing that receives both the input text and the output text to produce the adequate regex phrase, specially if the text size is limited to the practical minimum e.g. some dozens of characters? Is any tool available in this regard?

phrase = the_inverse_tool(
         input = "12$abc@#EF345", 
         output=["12", "abc", "EF", "345"]) 
       = "\d+|[a-zA-Z]+"

Certainy! If, for example, the input text were `"cat"` and the output text were `"dog"` we could replace the match of `\bcat\b`, `\c.*`, `\.*a.*` and other regular expressions with `"dog"`. Which is correct? The problem is this. When a regular expression is used to match a string or part of a string two things are needed: the string itself and the *rule* to be implemented by the regular expression (e.g., match the first word after the second comma). You are asking to produce the regular expression from the two strings without the rule. That's just silly. — Cary Swoveland, Jun 05 '20 at 01:41
My point can be illustrated by the example in the question. It is suggested the desired regular expression would be "`\d+|[a-zA-Z]+"`. Another would be `"12|abc|EF|345"`, but neither would be of any use for matching other strings. — Cary Swoveland, Jun 05 '20 at 01:54

score 2 · Accepted Answer · answered Jun 05 '20 at 02:13

What you're asking appears to be whether there is some algorithm or existing library that takes an input string (like "12$abc@#EF345") and a set of matches (like ["12", "abc", "EF", "345"]) and produces an "adequate" regex that would produce the matches, given the input string.

However, what does 'adequate' mean in this context? For your example, a simple answer would be: "12|abc|EF|345". However, it appears you expect something more like the generalised "\d+|[a-zA-Z]+"

Note that your generalisation makes a number of assumptions, for example that words in French, Swedish or Chinese shouldn't be matched. And numbers containing , or . are also not included.

You cannot expect a generalised algorithm to make those kinds of distinctions, as those are essentially problems requiring general AI, understanding the problem domain at an abstract level and coming up with a suitable solution.

Another way of looking at it is: your question is the same as asking if there is some function or library that automates the work of a programmer (specific to the regex language). The answer is: no, not yet anyway, and by the time there is, there won't be people on StackOverflow asking or answering these question, because we'll all be out of a job.

However, some more optimistic viewpoints can be found here: Is it possible for a computer to "learn" a regular expression by user-provided examples?

Inverse regex processing to produce regex phrase

1 Answers1