5

I need to split strings containing basic mathematical expressions, such as:
"(a+b)*c"
or
" (a - c) / d"
The delimiters are + - * / ( ) and space and i need them as an independent token. Basically the result should look like this:

"("
"a"
"+"
"b"
")"
"*"
"c"

And for the second example:

" "
"("
"a"
" "
"-"
...

I read a lot of questions about similar problems with less complex delimiters and the common answer was to use zero space positive lookahead and -behind.
Like this: (?<=X | ?=X)
And X represents the delimiters, but putting them in a class like this:
[\\Q+-*()\\E/\\s]
does not work in the desired way.
So how do i have to format the delimiters to make the split work how i need it?

---Update---
Word class characters and longer combinations should not be splitted.
Such as "ab" "c1" or "12".
Or in short, I need the same result as the StringTokenizer would have, give the parameters "-+*/() " and true.

Thiemo Krause
  • 67
  • 1
  • 7
  • 2
    http://stackoverflow.com/questions/2226863/whats-a-good-library-for-parsing-mathematical-expressions-in-java – Zutty May 17 '13 at 13:55
  • How should `a+ab-c1+12` be splitted? Is `ab` one token or set of `a*b` and result for this part should be `a` `*` `b`? Are numbers possible in your string? – Pshemo May 17 '13 at 14:03
  • "ab" should stay "ab" as well as "c1" and "12" – Thiemo Krause May 17 '13 at 14:11
  • How about `"a__-c"` (lets say `_` are spaces), two spaces inside should result in one `"__"` two space token or two `"_"` `"_"` one space tokens? I assume that one two space token since `12` should stay `12` but just want to make sure. – Pshemo May 17 '13 at 14:15
  • it should be two one space tokens – Thiemo Krause May 17 '13 at 14:24

4 Answers4

1

It is one thing if you are doing this as student work, but in practice this is more of a job for a lexical analyzer and parser. In C, you would use lex and yacc or GNU flex and bison. In Java, you'd use ANTLR or JavaCC.

But start by writing a BNF grammar for your expected input (usually called the input language).

Eric Jablow
  • 7,874
  • 2
  • 22
  • 29
1

Try splitting your data using

yourString.split("(?<=[\\Q+-*()\\E/\\s])|(?=[\\Q+-*()\\E/\\s])(?<!^)"));

I assume that problem you had was not in \\Q+-*()\\E part but in (?<=X | ?=X) <- it should be (?<=X)|(?=X) since it should produce look-behind and look-ahead.


demo for "_a+(ab-c1__)+12_" (BTW _ will be replaced with space in code. SO shows two spaces as one, so had to use __ to present them somehow)

String[] tokens = " a+(ab-c1  )+12 "
        .split("(?<=[\\Q+-*()\\E/\\s])|(?=[\\Q+-*()\\E/\\s])(?<!^)");
for (String token :  tokens)
    System.out.println("\"" + token + "\"");

result

" "
"a"
"+"
"("
"ab"
"-"
"c1"
" "
" "
")"
"+"
"12"
" "
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • In addition to your answer "(?<=[\\Q+-*()\\E/\\s])|(?=(?!^)[\\Q+-*()\\E/\\s])" is needed because a leading delimiter such as the brackets would cause an empty string. – Thiemo Krause May 17 '13 at 15:29
  • @ThiemoKrause True, I updated my answer earlier with `(?=[\\Q+-*()\\E/\\s])(?<!^)` (sorry, forgot to inform you about that) but if you prefer `(?=(?!^)[\\Q+-*()\\E/\\s])` it is also OK. – Pshemo May 17 '13 at 15:43
0

Try this instead:

[-+*()\\s]

Dashes have to come first or last in a character class in order to not represent a range. The rest of the characters need no escaping (presumably what you were trying to do with \\Q and \\E) because most characters are taken literally anyway in a character class.

Also, I wasn't aware of the syntax, (?<=X|?=X). If it works, then great. But if it doesn't, try this equivalent expansion, whose syntax I know does work:

(?:(?<=X)|(?=X))
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • I changed the expression to (?:(?<=[-+*/()\\s]) | (?=[-+*/()\\s])) but it does not a single split if there are no spaces in the input string for example: (b+2)*6 – Thiemo Krause May 17 '13 at 14:21
0

You can use the following regex:

\s*(?<=[()+*/a-z-])\s*

?<= makes zero-witdh assertions, that is, they match, but won't include the matched expression in the group. The \s* will take care of the trailing spaces.

Code example:

String a = " (a - c) / d *       x   ";
String regex = "\\s*(?<=[()+*/a-z-])\\s*";
String[] split = a.split(regex);
System.out.println(Arrays.toString(split));

Output:

[ (, a, -, c, ), /, d, *, x]
acdcjunior
  • 132,397
  • 37
  • 331
  • 304
  • (Please fix the regex at the top also). – nhahtdh May 17 '13 at 14:05
  • @nhahtdh For clarity (and doubt's sake) I usually escape everything, but in this case your works just as well. With your excuse, I updated the answer. Thanks! – acdcjunior May 17 '13 at 14:06
  • I don't know how escape everything makes it clearer, but I do understand why you do that when you are in doubt. For me, it is harder to keep track of the characters in the character class when most of them are escaped. – nhahtdh May 17 '13 at 14:11
  • @nhahtdh It is clearer when the person who reads it is also unsure of when escaping is needed :) But I totally agree with you. – acdcjunior May 17 '13 at 14:19
  • sorry i forgot to mention longer combinations of word class characters sould stay in one token. Addet it. – Thiemo Krause May 17 '13 at 14:30