10

I want to enumerate all the possible values of a finite regular expression in Java for testing purposes.

For some context, I have a regular expression that I'm using to match allowable color values in words. Here's a shortened version of it as an example:

(white|black)|((light|dark) )?(red|green|blue|gray)

I wanted to create a unit test that would enumerate all these values and pass each of them to my utility class which produces a Color object from these, that way if I change the regular expression, my unit tests will fail if an error occurs (i.e. the new color value is unsupported).

I know enumeration is possible, of course (see this question), but is there an existing library for Java which will enumerate all the possible matches for a regex?

Edit: I've implemented a library that does this. See my answer below for links.

Brian
  • 17,079
  • 6
  • 43
  • 66
  • 2
    Do you want *all* of the infinite number of possible matches (with varying amounts of whitespace), or just some finite number of them? – Mark Byers Dec 03 '12 at 17:47
  • 2
    You say `finite regular expression` by which I suppose you mean `regular expression that describes a finite language`. Note that in that case, your regular expressions isn't "finite". It contains a `+`. – Martin Ender Dec 03 '12 at 17:49
  • @m.buettner Ah, that it does. I guess I'll have to remove the plus then. Oh well, I didn't like it anyway. – Brian Dec 03 '12 at 18:17
  • 2
    Reopening this question. The linked duplicate was for _random_ match generation, not generation of all possible matches. – Brian Dec 03 '12 at 22:55
  • Note your library doesn't need to be in Java. You can use any technology to generate your test cases then just simply save them to a `good.txt` file. – djechlin Dec 03 '12 at 23:38
  • @Brian well, I suppose if you run it long enough... – Maarten Bodewes Dec 03 '12 at 23:39
  • @owlstead Ha, well how do I know when it's done? :P – Brian Dec 03 '12 at 23:43
  • For the record, I'm using `matches`, so I don't need the `^$`. – Brian Dec 03 '12 at 23:49
  • @djechlin The problem with using other technologies is then that technology has to be installed on all dev machines. I'd rather have a Java solution which I can add as a dependency to my JUnit tests that can generate it dynamically. Also, putting all of them into a text file isn't useful since the reason I want it is for dynamic unit tests based on the possible values of the regex. Using a text file makes everything defined statically. – Brian Dec 03 '12 at 23:51
  • 3
    you need something like this: http://regldg.com/docs/index.php but I have not seen a java impl – maasg Dec 04 '12 at 00:00
  • @maasg That's exactly what I need. If only it were Java. – Brian Dec 04 '12 at 00:10
  • See also [Generate all valid values for a regular expression](http://stackoverflow.com/questions/15950113/generate-all-valid-values-for-a-regular-expression) and [Using Regex to generate Strings rather than match them](http://stackoverflow.com/questions/22115/using-regex-to-generate-strings-rather-than-match-them) – Sjoerd Apr 06 '17 at 12:29

2 Answers2

3

You are right, didn't find such a tool online as well but you can try Xeger from google

it can create a random matching string from a regexp, and with some code tweaking might do what you want. generation a random match:

String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);

Xeger code is very simple, it consists of 2 files which contain 5 methods between them..
it uses dk.brics.automaton to conver the regex to an automaton, then goes over the automaton transitions making random choices in every node.

the main function is generate:

   private void generate(StringBuilder builder, State state) {
    List<Transition> transitions = state.getSortedTransitions(true);
    if (transitions.size() == 0) {
        assert state.isAccept();
        return;
    }
    int nroptions = state.isAccept() ? transitions.size() : transitions.size() - 1;
    int option = XegerUtils.getRandomInt(0, nroptions, random);
    if (state.isAccept() && option == 0) {          // 0 is considered stop
        return;
    }
    // Moving on to next transition
    Transition transition = transitions.get(option - (state.isAccept() ? 1 : 0));
    appendChoice(builder, transition);
    generate(builder, transition.getDest());
}

you can see that in order to change it so you get all possible matches, you need to iterate over all possible combinations in every possible node (like incrementing a multi digit counter) you will need a hash to prevent loops, but that shouldn't take more than 5 senconds to code..

i would also suggest first checking that the regex is actually finate, by checking that it doesn't have *,+ and other symbols that make this action impossible (just to make this a complete tool for reuse)...

Amitbe
  • 391
  • 1
  • 5
  • +1, this is pretty much what I was expecting I'd have to do. I've been looking into `dk.brics.automaton` and its usage in Xeger. Luckily, `dk.brics.automaton` is BSD-licensed. Perhaps I can write up an open source library for this since it seems like no one else has implemented this (at least not in open source). – Brian Dec 05 '12 at 16:07
  • 1
    On your finite comment, I should be good with just unclosed quantifiers, including greedy, reluctant, and possessive `*`, `+`, and `{n,}`. I'm pretty sure those are the only ones that can result in unbounded repetition. – Brian Dec 05 '12 at 16:10
1

For future browsers coming to this question, I wrote a library that uses dk.brics.automaton using a similar approach to Xeger from the accepted answer and published it. You can find it:

To add it as a dependency:

Maven

<dependency>
    <groupId>com.navigamez</groupId>
    <artifactId>greex</artifactId>
    <version>1.0</version>
</dependency>

Gradle

compile 'com.navigamez:greex:1.0'

Sample Code

Using this question as an example:

GreexGenerator generator = new GreexGenerator("(white|black)|((light|dark) )?(red|green|blue|gray)");
List<String> matches = generator.generateAll();
System.out.println(matches.size()); // "14"
Brian
  • 17,079
  • 6
  • 43
  • 66