8

I need to validate a user given String and validate that it is a valid Set, possibly a set that contains inner sets. Examples:

 1) {1, 2, 3, 4} = valid
 2) {1, 2, {3, 4}, 5} = valid
 3) 1, 2, 3, 4 = invalid (missing brackets)
 4) {1, 2, {3, 4, 5} = invalid (missing inner bracket)

This is the regex I am using (broken up for readability):

String elementSeparator = "(,\\s)?";
String validElement = "(\\{?[A-Za-z0-9]*\\}?" + elementSeparator + ")*";
String regex = "^\\{" + validElement + "\\}$";

Currently the it accepts Sets with optional opening and closing brackets, but I need it to only accept if they are both there, and not if an inner set is missing a bracket. In my current implementation the 4th example is accepted as a valid set.

How can I accomplish this?

Filip Roséen - refp
  • 62,493
  • 20
  • 150
  • 196
unmuse
  • 677
  • 2
  • 9
  • 21
  • 4
    Regex can't parse nested brackets [to an infinite level]. – Bergi Oct 31 '12 at 18:43
  • 3
    At least Java's regex flavor can't. This would be possible with PCRE though. – Martin Ender Oct 31 '12 at 18:43
  • It cannot be done. See [this thread](http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns), for example. – Ted Hopp Oct 31 '12 at 18:44
  • http://stackoverflow.com/questions/2415872/is-it-possible-matching-the-exact-same-number-of-opening-closing-braces – assylias Oct 31 '12 at 18:44
  • 5
    Using a single regex to validate a recursive grammar (such as this one) is generally stretching the power of regexes to the breaking point. Write a simple state machine with a nesting counter instead. – Jim Garrison Oct 31 '12 at 18:45
  • +1 for Jim Garrison, write a state machine instead. This one should actually be pretty straightforward. – FloppyDisk Oct 31 '12 at 18:47
  • 2
    @JimGarrison - You need something a little more powerful than a state machine, since state machines == finite state automata == regular grammar == regex (in power). Note that just adding a counter (or a recursion stack) to keep track of parenthesis depth takes things out of the realm of state machines. – Ted Hopp Oct 31 '12 at 18:51
  • 1
    Only if your definition of "state machine" is purist and minimal "-) – Jim Garrison Oct 31 '12 at 18:59
  • It does not need to be recursive, I just need to pass the given test cases (this is for a school project). The problem is how do I validate that an inner set has opening and closing curly braces. – unmuse Oct 31 '12 at 19:00
  • @JimGarrison Except you can't have a counter in a finite state machine – NullUserException Oct 31 '12 at 19:07
  • @NullUserException Why not? Basic formal FSMs can be extended in many ways, and adding a stack or a counter are just two possible extensions. – Jim Garrison Oct 31 '12 at 23:45
  • @JimGarrison Then it wouldn't be an FSM, it would be something else entirely. If we deviate from formal definitions, what's the point of having them? – NullUserException Oct 31 '12 at 23:52

3 Answers3

4

Here's some Java pseudo-code for how to approach this problem without using any heavyweight tools such as ANTLR. The basic approach is to split the input into tokens consisting of

  1. A single open or close brace
  2. A comma
  3. Whitespace
  4. An identifier

Then you scan through the tokens, keeping track of the nesting level as you go. If when you get to the end the nesting level isn't zero, the input string has an unbalanced brace.

Pattern token = Pattern.compile("([{}]|,|[A-Aa-z0-9]+|\s+)");
int nesting = 0
Matcher m = token.matcher(inputString);
while(m.find())
{
    if (m.group(1).equals("{")
        nesting++;
    else if (m.group(1).equals("}")
    {
        nesting--;
        if (nesting < 0)
            error - too many right braces
    }
    else
        ....
}
if (nesting != 0) 
    log("incorrect nesting");

Once you have this framework in place you can enhance it to detect things like two commas in a row: set a flag when you see a comma, clear the flag when you see an identifier (but not whitespace). In the branch for comma and close brace you test the flag and issue an error message since a comma at that point is not valid. And so on, for whatever validation you need.

Note that my pseudocode above is not a complete solution, just intended to give you the general approach. A complete solution would be somewhat more involved as it would have to deal with invalid characters, making the lexer (the part that breaks up the string into tokens) more complex.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
3

Because of your use of matching brackets, a simple regex grammar won't be enough. You'll need to look into what are called Context Free Grammars. I recommend looking into ANTLR, but it will be a much more heavy weight solution than you thought you would need.

Joe
  • 366
  • 1
  • 6
0

An easy way would be to search for the last '{', then the '}' that immediately follows. Then verify that the text in between is valid (should be a comma separated list). Then replace the entire string (from '{' to '}' with a dummy value, e.g. 0. Then repeat until you are left with 0, or you encounter an error.

Darren
  • 722
  • 4
  • 12