14

I have a strange problem, at least one I've never come across. I have a precondition where customers have simple regular expressions associated with labels. The labels are all they care about. What I would like to do is create a list of all possible numbers that would match each of these regular expressions. I would have logic that would warn me when the list is beyond a certain threshold.

Here is an example of the regular expression: 34.25.14.(227|228|229|230|243|244|245|246)

Lets say that these ip(s) are associated with ACME. Behind the scenes when the user selects ACME (in our UI), I'm filling out a filter object that contains all of those possible numbers and submitting them as an OR query to a highly specialized Vertica database.

I just can't determine an elegant way of creating a list of numbers from said regular expressions.

The others aspect of this, is that the java code within another portion of the product is using those regular expressions to show ACME by using a java Pattern.compile(), which means the customer 'could' create a complex regular expression. I've only seen them, so far, use something simple as shown above.

Is there a method that will generate a list based on regular expression?

Thanks for your time.

AlexR
  • 114,158
  • 16
  • 130
  • 208
D-Klotz
  • 1,973
  • 1
  • 15
  • 37
  • This is a good question, never thought about this before. – aglassman Oct 16 '12 at 21:23
  • For certain regular expressions yes. But what if they choose something as trivial as .+ – Eduardo Oct 16 '12 at 21:26
  • is your regex always going to be like (digits).(digits).(digits).(digits|digits|digits) pattern? – Victor Mukherjee Oct 16 '12 at 21:31
  • @VictorMukherjee - For this product, since they are matching IP addresses, yes, the regex pattern should be simple as the example above. They are attempting to match a range or at worst (perhaps) a subnet, to a label. – D-Klotz Oct 17 '12 at 13:31
  • Correct me if I'm wrong, but isn't 34.25.14.X the wrong regex for IP numbers? Shouldn't it be 34\.25\.14\.X as the dot is a special character? – Kirill Rakhman Oct 23 '12 at 16:59

5 Answers5

7

Related:

A library that generates data matching a regular expression (with limitations): http://code.google.com/p/xeger/

Several solutions, such as conversing a regex into a grammar: Using Regex to generate Strings rather than match them


EDIT: Actually, you can make it work!!! The only thing to address is to impose some domain-specific constraints to preclude combinatorial explosion like a+.

If you add to the Xeger class something like this:

public void enumerate() {
    System.out.println("enumerate: \"" + regex + "\"");
    int level = 0;
    String accumulated = "";
    enumerate(level, accumulated, automaton.getInitialState());
}

private void enumerate(int level, String accumulated, State state) {
    List<Transition> transitions = state.getSortedTransitions(true);
    if (state.isAccept()) {
        System.out.println(accumulated);
        return;
    }
    if (transitions.size() == 0) {
        assert state.isAccept();
        return;
    }
    int nroptions = state.isAccept() ? transitions.size() : transitions.size() - 1;
    for (int option = 0; option <= nroptions; option++) {
        // Moving on to next transition
        Transition transition = transitions.get(option - (state.isAccept() ? 1 : 0));
        for (char choice = transition.getMin(); choice <= transition.getMax(); choice++) {
            enumerate(level + 1, accumulated + choice, transition.getDest());
        }
    }
}

... and something like this to XegerTest:

@Test
public void enumerateAllVariants() {
    //String regex = "[ab]{4,6}c";
    String regex = "34\\.25\\.14\\.(227|228|229|230|243|244|245|246)";
    Xeger generator = new Xeger(regex);
    generator.enumerate();
}

... you will get this:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running nl.flotsam.xeger.XegerTest
enumerate: "34\.25\.14\.(227|228|229|230|243|244|245|246)"
34.25.14.227
34.25.14.228
34.25.14.229
34.25.14.243
34.25.14.244
34.25.14.245
34.25.14.246
34.25.14.230
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.114 sec

... and, guess what. For "[ab]{4,6}c" it correctly produces 112 variants.

It's really a quick and dirty experiment, but it seems to work ;).

Community
  • 1
  • 1
full.stack.ex
  • 1,747
  • 2
  • 11
  • 13
  • 3
    Don't just post links - at least include a summary of what they contain, and show how they are relevant. – jrtc27 Oct 16 '12 at 21:30
  • This link references a package that can generate a number that matches the regex, but not all numbers (I think). – Chris Gerken Oct 16 '12 at 21:31
  • Right. And the solution can hardly be simpler than that provided by the library. At least it can be a starting point. – full.stack.ex Oct 16 '12 at 21:33
  • 1
    @ChrisGerken: It is impossible to generate all matches of a regular expression. How would you generate them for something as trivial as `a+`? – Eduardo Oct 16 '12 at 21:33
  • 2
    Guys, this s a good answer (+1). I call you to up-vote. The link to other discussion is very helpful (IMHO). The answer marked as right there explains how to extract portions of internal representation of regex. – AlexR Oct 16 '12 at 21:34
  • 1
    @Eduardo Thankfully having an expression such as that would defeat the purpose of the product it is used within. They wish to match a small subset of IP addresses to a label. – D-Klotz Oct 17 '12 at 14:22
  • @full.stack.ex I've taken a look at the library (xeger), a surface scan if you will. Is it your understanding that it generates ALL numbers possible from a regular expression? With the qualifier that the library supports a subset of all possible regular expressions. – D-Klotz Oct 17 '12 at 14:25
  • Well, hard to tell without true reverse engineering. They expose one method, generate(), but, under the hood, they really generate a random int and pass it to the underlying mechanism. Looks like that's the index of the variant to generate! So if you take a look at the underlying mechanism or just iterate through the indexes - it may work for you. I mean, you'll need to download the source and play with it. – full.stack.ex Oct 17 '12 at 14:35
  • ... and then contact the developer? – full.stack.ex Oct 17 '12 at 14:39
  • I think the following code illustrates the folly of my request: `public class XegerTest { @Test public void shouldGenerateTextCorrectly() { String regex = "[ab]{4,6}c"; Xeger generator = new Xeger(regex); for (int i = 0; i < 100; i++) { String text = generator.generate(); assertTrue(text.matches(regex)); } } }` It can only generate a subset. – D-Klotz Oct 17 '12 at 14:54
  • Actually, I meant the source code _within_ the generate method. Upon a very cursory look, they generate a list of possible "transitions", pick one _randomly_, and then apply a similar procedure recursively. So the hypothesis is, if you hack their code to iterate over all the transitions instead of picking them randomly and you manage to control the combinatorial explosion by applying some specific heuristics based on your domain assumptions, you may still get it done. Maybe not, but that seems to be the best starting point anyway. – full.stack.ex Oct 17 '12 at 16:12
  • 1
    /Never(, ever)+ give up/ :) Stuff like this helps us grow. So thank you for the question. I've edited this answer with some results. – full.stack.ex Oct 17 '12 at 19:36
0

I'd say the technical answer is no, since you can specify in a regex that a character (digit) occurs zero or more times. That "or more" can mean an arbitrarily large number of digits. In practice you might be able to limit the length of the strings and recursively build out a superset of Strings based on the chars you find in the regex and then text them to create your subset list.

Chris Gerken
  • 16,221
  • 6
  • 44
  • 59
  • The zero or more should not be present within this product, but even so, might and cannot are not the same. I had hoped for an elegant algorithm/formula that solved the problem but it looks like that was a pipe dream. ;) – D-Klotz Oct 17 '12 at 15:09
0

Comment: regular expression is a recursively enumerable language, therefore there exists an algorithm than can enumerate all valid strings.

Even for a more complex language like Java, there can be an algorithm that enumerates all Java programs; no matter how your particular Java program is written, that algorithm will, in finite time, output a Java program that exactly matches yours.

But not all languages are enumerable like this; then they are the languages that we know exist but we cannot speak in. They forever elude any primitive intelligence like ours that's but a turing machine.

irreputable
  • 44,725
  • 9
  • 65
  • 93
0

Brute force, Multithreaded, CPU load 100%:

import java.util.regex.Pattern;

public class AllIP
{
   public static void main( String[] args )
   {
      final Pattern p = Pattern.compile( "34.25.14.(227|228|229|230|243|244|245|246)" );

      int step = 256 / Runtime.getRuntime().availableProcessors();
      for( int range = 0; range < 256; range += step )
      {
         final int from = range;
         final int to   = range + step;
         new Thread(){
            public @Override void run(){
               long atStart = System.currentTimeMillis();
               for( int i = from; i < to ; ++i )
                  for( int j = 1; j < 255; ++j )
                     for( int k = 1; k < 255; ++k )
                        for( int l = 1; l < 255; ++l )
                        {
                           String ip = String.format( "%d.%d.%d.%d", i,j,k,l );
                           if( p.matcher( ip ).matches())
                           {
                              System.out.println( ip );
                           }
                        }
               System.out.println( System.currentTimeMillis() - atStart );
            }
         }.start();
      }
   }
}

Just for fun...

  • 34.25.14.227
  • 34.25.14.228
  • 34.25.14.229
  • 34.25.14.230
  • 34.25.14.243
  • 34.25.14.244
  • 34.25.14.245
  • 34.25.14.246

elapsed time: 01:18:59

Operating System

  • Microsoft Windows XP Professionnel 32-bit SP3

CPU

  • Intel Xeon W3565 @ 3.20GHz 39 °C

  • Bloomfield 45nm Technology

RAM

  • 4,00 Go Dual-Channel DDR3 @ 532MHz (7-7-7-20)

Motherboard

  • LENOVO Lenovo (1366-pin LGA)
Aubin
  • 14,617
  • 9
  • 61
  • 84
  • Thanks for the example. I'm certain the user will be understanding and appreciate the exercise. :) – D-Klotz Oct 17 '12 at 15:04
0

This problem is amazing!

I solved it with a simple JavaCC grammar and 3 classes : Constant, Variable and a Main. The expression I used as input is (34|45).2\d.14.(227|228|229|230|243|244|245|246), note the \d to specify variable part.

Here is the JavaCC grammar:

options
{
    static         = false;
    FORCE_LA_CHECK = true;
    LOOKAHEAD      = 5;
}

PARSER_BEGIN( IPGeneratorFromRegExp )
package hpms.study.jj;

public class IPGeneratorFromRegExp
{
    public final java.util.SortedSet< hpms.study.IPPart > _a = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _b = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _c = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _d = new java.util.TreeSet< hpms.study.IPPart >();

}// class IPGeneratorFromRegExp

PARSER_END( IPGeneratorFromRegExp )

SKIP :
{
  " "
| "\r"
| "\t"
| "\n"
}

TOKEN :
{
    < IPPARTX   :                          (["1"-"9"] | "\\d" ) > 
|   < IPPARTXX  :     (["1"-"9"] | "\\d" ) (["0"-"9"] | "\\d" ) > 
|   < IPPART1XX : "1" (["0"-"9"] | "\\d" ) (["0"-"9"] | "\\d" ) >
|   < IPPART2XX : "2" (["0"-"4"] | "\\d" ) (["0"-"9"] | "\\d" ) >
|   < IPPART25X : "2" "5"                  (["0"-"5"] | "\\d" ) >
}

void part( java.util.SortedSet< hpms.study.IPPart > parts ):{ Token token = null; }
{
    (
        token=<IPPART25X>
    |   token=<IPPART2XX>
    |   token=<IPPART1XX>
    |   token=<IPPARTXX>
    |   token=<IPPARTX>
    ){
        parts.add(
            token.image.contains( "\\d" )
                ? new hpms.study.Variable( token.image )
                : new hpms.study.Constant( token.image ));
    }
}

void expr( java.util.SortedSet< hpms.study.IPPart > parts ):{}
{
    "(" part( parts ) ( "|" part( parts ))+ ")"
}

void analyze() :{}
{
    ( part( _a ) | expr( _a )) "."
    ( part( _b ) | expr( _b )) "."
    ( part( _c ) | expr( _c )) "."
    ( part( _d ) | expr( _d ))
}

And a fragment of Main.java:

     Reader                source      = new StringReader( _regExp.getText());
     IPGeneratorFromRegExp ipGenerator = new IPGeneratorFromRegExp( source );
     ipGenerator.analyze();

     SortedSet< String > a = new TreeSet<>();
     SortedSet< String > b = new TreeSet<>();
     SortedSet< String > c = new TreeSet<>();
     SortedSet< String > d = new TreeSet<>();
     for( IPPart<?> part : ipGenerator._a ) part.expandTo( a );
     for( IPPart<?> part : ipGenerator._b ) part.expandTo( b );
     for( IPPart<?> part : ipGenerator._c ) part.expandTo( c );
     for( IPPart<?> part : ipGenerator._d ) part.expandTo( d );

     _result.clear();
     for( String ac : a )
     {
        for( String bc : b )
        {
           for( String cc : c )
           {
              for( String dc : d )
              {
                 _result.add( ac + '.' + bc + '.' + cc + '.'  + dc );
              }// for
           }// for
        }// for
     }// for

And a snapshot of the resulting Swing application (very responsive):

A snapshot of the application

The whole Eclipse project may download from lfinance.fr.

Aubin
  • 14,617
  • 9
  • 61
  • 84
  • I spent a few minutes looking here: http://javacc.java.net and confirmed that I have no experience with the Java Compiler Compiler. This is an interesting solution. It looks as if it must be tailored to a context. In this example an IP address syntax. Some of the regular expressions I'm dealing with are SS7 point codes of the form 10.1.* ( reference: http://docs.telcobridges.com/tbwiki/Toolpack:Creating_SS7_Point_Codes_A ). – D-Klotz Oct 17 '12 at 21:20
  • 1
    JavaCC generate the parser and instantiates the generator objects of each part. 10.1.* will be decomposed into . The business knowledge allow us to develop the behavior of the variable part, here a variation between x from y to generate a list of SS7 point codes like 10.0.1, 10.1.2, 10.1.3... The complexity is more into the Variable class than into the JavaCC grammar, I think. – Aubin Oct 17 '12 at 21:30