6

No question on SO addresses my particular problem. I know very little about regular expression. I am building an expression parser in Java using Regex Class for that purpose. I want to extract Operands, Arguments, Operators, Symbols and Function Names from expression and then save to ArrayList. Currently I am using this logic

String string = "2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)" //This is just for testing purpose later on it will be provided by user
List<String> res = new ArrayList<>();
Pattern pattern = Pattern.compile((\\Q^\\E|\\Q/\\E|\\Q-\\E|\\Q-\\E|\\Q+\\E|\\Q*\\E|\\Q)\\E|\\Q)\\E|\\Q(\\E|\\Q(\\E|\\Q%\\E|\\Q!\\E)) //This string was build in a function where operator names were provided. Its mean that user can add custom operators and custom functions 
Matcher m = pattern.matcher(string);
int pos = 0;
while (m.find()) 
{
    if (pos != m.start()) 
    {
        res.add(string.substring(pos, m.start()))
    }
    res.add(m.group())
    pos = m.end();
}
if (pos != string.length()) 
{
     addToTokens(res, string.substring(pos));
}
for(String s : res)
{
     System.out.println(s);
}

Output:

2
!
+
atan2
(
3
+
9
,
2
+
3
)
-
2
*
PI
+
3
/
3
-
9
-
12
%
3
*
sin
(
9
-
9
)
+
(
2
+
6
/
2
)

Problem is that now Expression can contain Matrix with user defined format. I want to treat every Matrix as a Operand or Argument in case of functions.

Input 1:

String input_1 = "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6"

Output Should be:

2
+
3
-
9
*
[{2+3,2,6},{7,2+3,2+3i}]
+
9
*
6

Input 2:

String input_2 = "{[2,5][9/8,func(2+3)]}+9*8/5"

Output Should be:

{[2,5][9/8,func(2+3)]}
+
9
*
8
/
5

Input 3:

String input_3 = "<[2,9,2.36][2,3,2!]>*<[2,3,9][23+9*8/8,2,3]>"

Output Should be:

<[2,9,2.36][2,3,2!]>
*
<[2,3,9][23+9*8/8,2,3]>

I want that now ArrayList should contain every Operand, Operators, Arguments, Functions and symbols at each index. How can I achieve my desired output using Regular expression. Expression validation is not required.

Kamil Mahmood
  • 527
  • 9
  • 22

2 Answers2

3

I think you can try with something like:

(?<matrix>(?:\[[^\]]+\])|(?:<[^>]+>)|(?:\{[^\}]+\}))|(?<function>\w+(?=\())|(\d+[eE][-+]\d+)|(?<operand>\w+)|(?<operator>[-+\/*%])|(?<symbol>.)

DEMO

elements are captured in named capturing groups. If you don't need it, you can use short:

\[[^\]]+\]|<[^>]+>|\{[^\}]+\}|\d+[eE][-+]\d+|\w+(?=\()|\w+|[-+\/*%]|.


The \[[^\]]+\]|<[^>]+>|\{[^\}]+\} match opening bracket ({, [ or <), non clasing bracket characters, and closing bracket (},],>) so if there are no nested same-type brackets, there is no problem. Implementatin in Java:

public class Test {
    public static void main(String[] args) {
        String[] expressions = {"2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)", "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6",
        "{[2,5][9/8,func(2+3)]}+9*8/5","<[2,9,2.36][2,3,2!]>*<[2,3,9][23 + 9 * 8 / 8, 2, 3]>"};
        Pattern pattern = Pattern.compile("(?<matrix>(?:\\[[^]]+])|(?:<[^>]+>)|(?:\\{[^}]+}))|(?<function>\\w+(?=\\())|(?<operand>\\w+)|(?<operator>[-+/*%])|(?<symbol>.)");
        for(String expression : expressions) {
            List<String> elements = new ArrayList<String>();
            Matcher matcher = pattern.matcher(expression);
            while (matcher.find()) {
                elements.add(matcher.group());
            }
            for (String element : elements) {
                System.out.println(element);
            }
            System.out.println("\n\n\n");
        }
    }
}

Explanation of alternatives:

  • \[[^\]]+\]|<[^>]+>|\{[^\}]+\} - match opening bracket of given type, character which are not closing bracket of that type (everything byt not closing bracket), and closing bracket of that type,
  • \d+[eE][-+]\d+ = digit, followed by e or E, followed by operator + or -, followed by digits, to capture elements like 2e+3
  • \w+(?=\() - match one or more word characters (A-Za-z0-9_) if it is followed by ( for matching functions like sin,
  • \w+ - match one or more word characters (A-Za-z0-9_) for matching operands,
  • [-+\/*%] - match one character from character class, to match operators
  • . - match any other character, to match other symbols

Order of alternatives is quite important, as last alternative . will match any character, so it need to be last option. Similar case with \w+(?=\() and \w+, the second one will match everything like previous one, however if you don't wont to distinguish between functions and operands, the \w+ will be enough for all of them.

In longer exemple the part (?<name> ... ) in every alternative, is a named capturing group, and you can see in demo, how it group matched fragments in gorups like: operand, operator, function, etc.

m.cekiera
  • 5,365
  • 5
  • 21
  • 35
2

With regular expressions you cannot match any level of nested balanced parentheses.

For example, in your second example {[2,5][9/8,func(2+3)]} you need to match the opening brace with the close brace, but you need to keep track of how many opening and closing inner braces/parens/etc there are. That cannot be done with regular expressions.

If, on the other hand, you simplify your problem to remove any requirement for balancing, then you probably can handle with regular expressions.

Jonah Graham
  • 7,890
  • 23
  • 55
  • Is it possible that I say word start from this and end at this and match anything in between? – Kamil Mahmood Oct 11 '15 at 21:17
  • Not sure I follow. If you want to match anything between the first opening paren and the last closing paren you can do something like this (untested): `\((.*)\)` the inner parens will be what you want. – Jonah Graham Oct 11 '15 at 21:20