109

Is it possible for a computer to "learn" a regular expression by user-provided examples?

To clarify:

  • I do not want to learn regular expressions.
  • I want to create a program which "learns" a regular expression from examples which are interactively provided by a user, perhaps by selecting parts from a text or selecting begin or end markers.

Is it possible? Are there algorithms, keywords, etc. which I can Google for?

EDIT: Thank you for the answers, but I'm not interested in tools which provide this feature. I'm looking for theoretical information, like papers, tutorials, source code, names of algorithms, so I can create something for myself.

Anderson Green
  • 30,230
  • 67
  • 195
  • 328
Daniel Rikowski
  • 71,375
  • 57
  • 251
  • 329

10 Answers10

55

Yes, it is possible, we can generate regexes from examples (text -> desired extractions). This is a working online tool which does the job: http://regex.inginf.units.it/

Regex Generator++ online tool generates a regex from provided examples using a GP search algorithm. The GP algorithm is driven by a multiobjective fitness which leads to higher performance and simpler solution structure (Occam's Razor). This tool is a demostrative application by the Machine Lerning Lab, Trieste Univeristy (Università degli studi di Trieste). Please look at the video tutorial here.

This is a research project so you can read about used algorithms here.

Behold! :-)

Finding a meaningful regex/solution from examples is possible if and only if the provided examples describe the problem well. Consider these examples that describe an extraction task, we are looking for particular item codes; the examples are text/extraction pairs:

"The product code is 467-345A" -> "467-345A"
"The item 789-345B is broken"  -> "789-345B"

An (human) guy, looking at the examples, may say: "the item codes are things like \d++-345[AB]"

When the item code is more permissive but we have not provided other examples, we have not proofs to understand the problem well. When applying the human generated solution \d++-345[AB] to the following text, it fails:

"On the back of the item there is a code: 966-347Z"

You have to provide other examples, in order to better describe what is a match and what is not a desired match: --i.e:

"My phone is +39-128-3905 , and the phone product id is 966-347Z" -> "966-347Z"

The phone number is not a product id, this may be an important proof.

Fabiano Tarlao
  • 3,024
  • 33
  • 40
  • 7
    This should be top answer. It is possible, and this demonstrates it. The source is available here: https://github.com/MaLeLabTs/RegexGenerator – rjurney Nov 28 '15 at 18:04
  • Your example of the product codes illustrates why said human should look up the specification for product codes and write the regular expression based on the specification, rather than trying to infer the regex from a limited set of sample product codes (regardless of whether a person or a program is trying to infer the regex). – Jan Goyvaerts Feb 12 '16 at 02:57
  • 3
    This is the right way to do things. My example is only a way to, conceptually, explain the problem. Sometimes there is no specification, or the guy is simply not able to write the regular expression(lack of knowledge) by his own. – Fabiano Tarlao Feb 12 '16 at 10:58
  • To be more precise and unambiguous :-), "This is the right way to do things" -> "You are right, your is the best way to do things, you should always start from specifications when possible" – Fabiano Tarlao Feb 12 '16 at 11:59
  • 3
    The article "Inference of Regular Expressions for Text Extraction from Examples" contains a detailed explanation of the algorithm http://machinelearning.inginf.units.it/publications/international-journal-publications/inferenceofregularexpressionsfortextextractionfromexamples – mimmuz Apr 13 '16 at 15:18
49

The book An Introduction to Computational Learning Theory contains an algorithm for learning a finite automaton. As every regular language is equivalent to a finite automaton, it is possible to learn some regular expressions by a program. Kearns and Valiant show some cases where it is not possible to learn a finite automaton. A related problem is learning hidden Markov Models, which are probabilistic automata that can describe a character sequence. Note that most modern "regular expressions" used in programming languages are actually stronger than regular languages, and therefore sometimes harder to learn.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
Yuval F
  • 20,565
  • 5
  • 44
  • 69
37

No computer program will ever be able to generate a meaningful regular expression based solely on a list of valid matches. Let me show you why.

Suppose you provide the examples 111111 and 999999, should the computer generate:

  1. A regex matching exactly those two examples: (111111|999999)
  2. A regex matching 6 identical digits (\d)\1{5}
  3. A regex matching 6 ones and nines [19]{6}
  4. A regex matching any 6 digits \d{6}
  5. Any of the above three, with word boundaries, e.g. \b\d{6}\b
  6. Any of the first three, not preceded or followed by a digit, e.g. (?<!\d)\d{6}(?!\d)

As you can see, there are many ways in which examples can be generalized into a regular expression. The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches.

If you don't want to list all possible matches, you need a higher-level description. That's exactly what regular expressions are designed to provide. Instead of providing a long list of 6-digit numbers, you simply tell the program to match "any six digits". In regular expression syntax, this becomes \d{6}.

Any method of providing a higher-level description that is as flexible as regular expressions will also be as complex as regular expressions. All tools like RegexBuddy can do is to make it easier to create and test the high-level description. Instead of using the terse regular expression syntax directly, RegexBuddy enables you to use plain English building blocks. But it can't create the high-level description for you, since it can't magically know when it should generalize your examples and when it should not.

It is certainly possible to create a tool that uses sample text along with guidelines provided by the user to generate a regular expression. The hard part in designing such a tool is how does it ask the user for the guiding information that it needs, without making the tool harder to learn than regular expressions themselves, and without restricting the tool to common regex jobs or to simple regular expressions.

Hosam Aly
  • 41,555
  • 36
  • 141
  • 182
Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • You are right, many learning algorithms I found after posting my question require positive and negative information. As far is I understand, an explicit "higher-level description" is not necessary, because the user is providing it by answering questions. – Daniel Rikowski Mar 07 '09 at 13:11
  • If a tool asks questions, then the combination of the questions and the answers given form the higher-level description. The quality of such tools largely depends on the questions it asks. – Jan Goyvaerts Mar 09 '09 at 13:00
  • That's stupid because if you provided another example, then you can weed out some of those possibilities. A further example weeds out more. – Cris Oct 23 '14 at 07:41
  • 2
    @Cris: The principle remains, regardless of how many samples you provide. It simply changes the possibilities. For example, adding 123456 changes #2 to (\d)\1{5}|123456 and #3 to [19]{6}|123456. Or it could change #3 to [1-69]{6}. It could even be that the desired pattern would match 6 identical digits or 6 digits where each digit is one greater than the preceding digit. Even if you provide 10,000 samples of 6-digit numbers, the program cannot distinguish between #1, #4, #5, or #6 without extra instructions from the user. – Jan Goyvaerts Oct 24 '14 at 03:43
  • I feel like this problem is easily solved as follows: The program tries to be as general as possible (within reason) and then gives you examples of other things it thinks would match. By simply telling it 'yes' and 'no' to proposed matches, you could help it understand the bounds of what you're actually trying to capture. I would love to see a tool in a text editor which used this concept and proposed matches from the currently open file. – Jon McClung Feb 07 '19 at 20:36
  • It's true you will never get a 100% correct regex from a subset of the words. But for many applications, we can cope with an approximation. All we need is an extra parameter to control how broad or tight we want the regex to be. Somewhere between matching only the given words and matching `.*`. I guess with some work it can be framed as a likelihood probability if we assume some distribution of the characters in their classes. – Celelibi Jan 03 '22 at 19:00
8

Yes, it's certainly "possible"; Here's the pseudo-code:

string MakeRegexFromExamples(<listOfPosExamples>, <listOfNegExamples>)
{
   if HasIntersection(<listOfPosExamples>, <listOfNegExamples>)
     return <IntersectionError>

   string regex = "";
   foreach(string example in <listOfPosExamples>)
   {
      if(regex != "")
      {
         regex += "|";
      }
      regex += DoRegexEscaping(example);
   }
   regex = "^(" + regex + ")$";

   // Ignore <listOfNegExamples>; they're excluded by definition

   return regex;
}

The problem is that there are an infinite number of regexs that will match a list of examples. This code provides the simplest/stupidest regex in the set, basically matching anything in the list of positive examples (and nothing else, including any of the negative examples).

I suppose the real challenge would be to find the shortest regex that matches all of the examples, but even then, the user would have to provide very good inputs to make sure the resulting expression was "the right one".

Daniel LeCheminant
  • 50,583
  • 16
  • 120
  • 115
  • 3
    It starts getting interesting when the user enters positive *and negative* samples. The regex would have to match the positive samples and not match on the negative ones. – user55400 Mar 06 '09 at 10:07
  • @blixtor - Actually, it's quite easy. Just don't put any of the negative example in the constructed regex, and they'll be rejected. Remember, the one that the code builds matches only positive example; negative examples (and anything else) are excluded by definition! – Daniel LeCheminant Mar 06 '09 at 16:12
  • Daniel is right. Without a higher-level description, a list of alternatives is all that can be inferred consistently and accurately from a list of examples. – Jan Goyvaerts Mar 07 '09 at 01:30
7

I believe the term is "induction". You want to induce a regular grammar.

I don't think it is possible with a finite set of examples (positive or negative). But, if I recall correctly, it can be done if there is an Oracle which can be consulted. (Basically you'd have to let the program ask the user yes/no questions until it was content.)

Jay Kominek
  • 8,674
  • 1
  • 34
  • 51
5

You might want to play with this site a bit, it's quite cool and sounds like it does something similar to what you're talking about: http://txt2re.com

Chad Birch
  • 73,098
  • 23
  • 151
  • 149
4

There's a language dedicated to problems like this, based on prolog. It's called progol.

As others have mentioned, the basic idea is inductive learning, often called ILP (inductive logic programming) in AI circles.

Second link is the wiki article on ILP, which contains a lot of useful source material if you're interested in learning more about the topic.

patros
  • 7,719
  • 3
  • 28
  • 37
  • I found [a paper](https://link.springer.com/chapter/10.1007/3-540-40030-3_8) that describes a method for grammar induction using inductive logic programming, but I've never seen an implementation of this method in Progol. – Anderson Green Mar 12 '23 at 18:14
3

@Yuval is correct. You're looking at computational learning theory, or "inductive inference. "

The question is more complicated than you think, as the definition of "learn" is non-trivial. One common definition is that the learner can spit out answers whenever it wants, but eventually, it must either stop spitting out answers, or always spit out the same answer. This assumes an infinite number of inputs, and gives absolutely no garauntee on when the program will reach its decision. Also, you can't tell when it HAS reached its decision because it might still output something different later.

By this definition, I'm pretty sure that regular languages are learnable. By other definitions, not so much...

Brian Postow
  • 11,709
  • 17
  • 81
  • 125
2

I've done some research on Google and CiteSeer and found these techniques/papers:

Also Dana Angluin's "Learning regular sets from queries and counterexamples" seems promising, but I wasn't able to find a PS or PDF version, only cites and seminar papers.

It seems that this is a tricky problem even on the theoretical level.

Daniel Rikowski
  • 71,375
  • 57
  • 251
  • 329
0

If its possible for a person to learn a regular expression, then it is fundamentally possible for a program. However, that program will need to be correctly programmed to be able to learn. Luckily this is a fairly finite space of logic, so it wouldn't be as complex as teaching a program to be able to see objects or something like that.

cjk
  • 45,739
  • 9
  • 81
  • 112
  • 1
    Not true, you should look up problems that are undecidable on Turing machines. – Stephen Curial Mar 05 '09 at 19:48
  • To be fair, I said if a person can learn a REGEX, then a machine can. I wasn't meaning it generally. – cjk Mar 05 '09 at 19:55
  • @scurial I don't think there are problems which are proved to be solvable by people but undecidable on turing machines, are there? – Sunny88 Nov 17 '11 at 14:58