Using Java's regular expression to identify tree node's children in parenthesized string.

Question

I am working on converting a parenthesized string such as f(d(a c(b))e) into a Tree data structure in Java (I am working on a method which would allow one to instantiate a Tree using the string representation). In the above string, f is the tree's root node which branches off into a subtree at d and a leaf-node at e. After I was able to identify f as the current node's label, I am left with d(a c(b))e.

I would like to be able to use Java's regular expressions to identify the children; in this case, d(a c(b)) and e. So, the requirements are as follows.

In the string, a single character may or may not be followed by parenthesis. If it is followed by parenthesis, return all of the substring inside, even if it contains nested parenthesis. So, the regular expression would match d(a c(b)) or e.

Moreover, I want this to work on more than just nodes with two children. A possible parenthesized string might be f(a b c) which is a tree rooted at f with 3 leafs.

So far, I have .\(?[^\(\)]\)? but this doesn't work.

Just start creating a parser and forget about regex. It may become so complicated that you won't even follow it. — HamZa, Nov 03 '13 at 22:45
It's not possible with regular expressions, see http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns Use StreamTokenizer and recursion http://docs.oracle.com/javase/7/docs/api/java/io/StreamTokenizer.html — Stefan Haustein, Nov 03 '13 at 23:02

score 4 · Accepted Answer · edited May 23 '17 at 12:34

It's not possible with regular expressions, see Can regular expressions be used to match nested patterns?

Use StreamTokenizer and recursion instead, should look similar to this (untested):

public class Node {
  private String name;
  private ArrayList<Node> children = new ArrayList<Node>();

  public static Node parseTree(String s) throws IOException {
    StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(s));
    tokenizer.nextToken();                 // Move to first token
    Node result = new Node(tokenizer);     // Parse root node (and children)
    if (tokenizer.ttype != StreamTokenizer.TT_EOF) {
      throw new RuntimeException("Leftover token: "+ tokenizer.ttype);
    }
    return result;
  }

  Node(StreamTokenizer tokenizer) throws IOException {
    if (tokenizer.ttype != StreamTokenizer.TT_WORD) {
      throw new RuntimeException("identifier expected; got: " + tokenizer.ttype);
    }
    name = tokenizer.sval;                  // read then name 
    if (tokenizer.nextToken() == '(') {     // Consume the name and check for Children
      tokenizer.nextToken();                // Yes, consume '('
      do {
        children.add(new Node(tokenizer));  // Add and parse a child
      } while (tokenizer.ttype != ')');     // Until we reach ')'
      tokenizer.nextToken();                // Consume ')'
    }
  }
}

(It is possible to write slightly simpler recursive parsing code without StreamTokenizer for this if the node names are all a single character and the separator is always just a single space)

Works like a charm. Two notes: it should be tokenizer.nextToken() and parseTree() method should not be void. — under_the_sea_salad, Nov 04 '13 at 21:39
I have a rather odd question: How did you know how to do this? Is this something you've encountered before or did you just know what the StreamTokenizer object was capable of doing so well that it did not take you long to come up with this solution? — under_the_sea_salad, Nov 06 '13 at 04:06
I have used StreamTokenizer before to implement parsers, so I was familiar with its capabilities — Stefan Haustein, Nov 08 '13 at 00:28

Using Java's regular expression to identify tree node's children in parenthesized string.

1 Answers1