2

I'm trying to split a string on ||, &&, and (), and I'm not able to properly split on nested parenthesis.

Example string:

q1 == false || ( q1 == true && q3 != null && ( method(param) - method() ) > 120 )

My current regex / code:

String[] tempTokens = input.split("(?=([|]{2}|[&]{2}|[(]|[)]))|(?<=([|]{2}|[&]{2}|[(]|[)]))");

for (String token : tempTokens) {
    if (token.trim().length() > 0) {
        System.out.println(token.trim());
    }
}

Current output:

q1 == false
||
(
q1 == true
&&
q3 != null
&&
(
method
(
param
)
- method
(
)
)
> 120
)

Wanted output:

q1 == false
||
(
q1 == true
&&
q3 != null
&&
( method(param) - method() ) > 120
)

Basically, I'm trying to tokenize an expression, and I want to split on the parenthesis only if they contain a full statement that contains a >, >=, ==, etc.

tima
  • 1,498
  • 4
  • 20
  • 28
  • 1
    Java regex engine doesn't have the capability to match balanced text (nesting). So, it can't match something like `( method(param) - method() )` –  Jul 31 '17 at 17:51
  • 1
    It would be abuse of regex, and you'd wind up writing a long, potentially buggy expression. Regular expressions are used for *regular grammars*. The code you are showing is context-free. Check out [Regular VS Context-Free Grammars](https://stackoverflow.com/questions/559763/regular-vs-context-free-grammars) and [Parsing If Statement with Regex](https://stackoverflow.com/questions/651455/regular-expression-to-identify-if-statements) – Vince Jul 31 '17 at 18:32
  • 1
    If the regex looks complicated, then you either risk losing it in the next code rewrite, because the next developer couldn't get their head around the logic or you risk creating a time complexity problem (I have seen this happen). This means you should look on how to simplify the regex or find a different approach to tokenising the string. In your case you seem to be dealing with language grammar, so a simple 'finite state machine' may be the approach needed? – Andre M Jul 31 '17 at 19:06
  • I'm already using the Shunting-Yard Algorithm to parse these expressions into a tree, but the thing is I wanted to keep each question and answer (`q1 == false`) together in one node and the only relationships would be `&&` and `||`, so my algorithm only has the `&&` and `||` as operators. So I'm trying see if it is easy to split the string into proper tokens to ignore all other operators. I will try to modify my code to parse all operators instead and see if that works out. – tima Jul 31 '17 at 19:21
  • Your requirement is very specific. A simple regex will not comply with it. I think it's better to define a parser with rules in it. – caisil Jul 31 '17 at 19:49
  • A suggestion: `String[] tempTokens = input.split("[|]{2}|[&]{2}|\\(\\s|\\s\\)");`. :) – caisil Jul 31 '17 at 19:55

4 Answers4

1

Thank you all for the suggestions! It was definitely easier to parse the expression without using regular expressions.

I used a modified Shunting-yard algorithm to convert the expression into postfix notation, and then build a tree from that.

public class ExpressionTreeNode {

    private ExpressionTreeNode left, right;
    private String content;

    public ExpressionTreeNode(ExpressionTreeNode left, ExpressionTreeNode right, String content) {
        this.left = left;
        this.content = content;
        this.right = right;
    }
}

Create the tree method:

private static ExpressionTreeNode createExpressionTree(String[] tokens) {
    final Stack<ExpressionTreeNode> nodes = new Stack<ExpressionTreeNode>();

    for (int i = 0; i < tokens.length; i++) {
        String token = tokens[i];

        if (Operator.isOperator(token)) {
            ExpressionTreeNode rightNode = nodes.pop();
            ExpressionTreeNode leftNode = nodes.pop();
            nodes.push(new ExpressionTreeNode(leftNode, rightNode, token));
        } else {
            nodes.add(new ExpressionTreeNode(null, null, token));
        }
    }

    return nodes.pop();
}

And usage:

// here convert to postfix notation
String[] tempTokens = part.split(" ");
String[] output = infixToPostfix(tempTokens);

ExpressionTreeNode root = createExpressionTree(output);

The trees are pretty huge because of all the operations (not just || and &&) separating into nodes but it will work for me.

Output for the example from the question:

                 /----- 120
         /----- >
         |       |       /----- method()
         |       \----- -
         |               \----- method(param)
 /----- &&
 |       |               /----- null
 |       |       /----- !=
 |       |       |       \----- q3
 |       \----- &&
 |               |       /----- true
 |               \----- ==
 |                       \----- q1
||
 |       /----- false
 \----- ==
         \----- q1
tima
  • 1,498
  • 4
  • 20
  • 28
-1

I don't think you will be able to parse such strings with regex properly. I expect you want to evaluate parts of such string. That's where you may want to look up about AST - Abstract Syntax Tree, parser generators. For С language it's lex + yacc. For java it's antlr or javacc. I have dealt with allo of them.

dimirsen Z
  • 873
  • 13
  • 16
-1

Perhaps using pattern/matcher could get you close

        String txt="( method(param) - method() ) >= 120";

    String re1="(\\()"; // Single Character (
    String re2="(\\s+)";    // White Space 1
    String re3="((?:[a-z][a-z0-9_]*))"; // Variable Name method
    String re4="(\\()"; //  Single Character (
    String re5="((?:[a-z][a-z0-9_]*))"; // Variable Name param
    String re6="(\\))"; // Single Character )
    String re7="( )";   // White Space 
    String re8="(-)";   // Single Character -
    String re9="( )";   // White Space 
    String re10="((?:[a-z][a-z0-9_]*))";    // Variable Name method
    String re11="(\\()";    //  Single Character (
    String re12="(\\))";    //  Single Character )
    String re13="( )";  // White Space 
    String re14="(\\))";    //  Single Character )
    String re15="( )";  // White Space
    String re16="(..?)";    // Any 1 or 2 Characters > >= ==
    String re17="( )";  // White Space 
    String re18="(\\d+)";   // Integer 120 

    Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12+re13+re14+re15+re16+re17+re18,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    Matcher m = p.matcher(txt);
    if (m.find())
    {
        String c1=m.group(1);
        String ws1=m.group(2);
        String var1=m.group(3);
        String c2=m.group(4);
        String var2=m.group(5);
        String c3=m.group(6);
        String ws2=m.group(7);
        String c4=m.group(8);
        String ws3=m.group(9);
        String var3=m.group(10);
        String c5=m.group(11);
        String c6=m.group(12);
        String ws4=m.group(13);
        String c7=m.group(14);
        String ws5=m.group(15);
        String c8=m.group(16);
        String ws6=m.group(17);
        String int1=m.group(18);
        System.out.println(c1.toString() + ws1.toString() + var1.toString() + c2.toString() + var2.toString() +c3.toString()+ws2.toString()+c4.toString()+ws3.toString()+var3.toString()+c5.toString()+c6.toString()+ws4.toString()+c7.toString()+ws5.toString()+c8.toString()+ws6.toString()+int1.toString()+"\n");
    }
Mike
  • 14
  • 2
-1

For your very particular solution, you can use this: (Edit, better coded)

    var s = "q1 == false || ( q1 == true && q3 != null && ( method(param) - method() ) > 120 )";
    var result = new List<string>();
    Parse(result, s);


    void Parse(List<string> result, string s)
    {
        var currentSplitter = string.Empty;
        var listOfSplitters = new string[] { "||", "&&" };
        var splitterPosition = -1;

        for (var i = 0; i < listOfSplitters.Length; i++)
        {
            currentSplitter = listOfSplitters[i];
            splitterPosition = s.IndexOf(currentSplitter);
            if (splitterPosition > -1)
            {
                break;
            }
        }

        if (splitterPosition == -1) //not found
        {
            result.Add(s);
            return;
        }

        var parenthesisPosition = s.IndexOf("(");
        if (parenthesisPosition == -1) //no more parenthesis
        {
            result.Add(s);
        }
        else if(parenthesisPosition > splitterPosition) //there is a parenthesis, but it's after a splitter
        {
            result.Add(s.Substring(0,splitterPosition-1));
            result.Add(currentSplitter);
            Parse(result, s.Substring(splitterPosition + currentSplitter.Length, s.Length - splitterPosition - currentSplitter.Length));
        }
        else //there is a parenthesis
        {
            result.Add("(");
            var lastpar = s.LastIndexOf(')');
            var sub = s.Substring(parenthesisPosition+1, lastpar -2);
            Parse(result, sub);
            result.Add(")");
        }
    }
Juan
  • 41
  • 5