2

I'm trying to split strings in a rather specific way. I've been fooling around using the .split() and .replaceall() methods, but I can't get it right.

Here are a couple of examples of the strings I need to split, followed by how they must be after the splitting. A , signifies a new string in the array.

Example 1: "(and (or (can-hit-robot) (wall) ) (can-hit-robot) (wall) ) )"

"(and", "(or", "(can-hit-robot)", "(wall)", ")", "(can-hit-robot)", "(wall)", ")"

Example 2: "(seq (shoot) (if (can-hit-robot) (shoot) (move) ) )"

"(seq", "(shoot)", "(if", "(can-hit-robot)", "(shoot)", "(move)", ")", ")"

Example 3: "(while(wall)(if (can-hit-robot)(shoot)(move)))"

"(while", "(wall)", "(if", "(can-hit-robot)", "(shoot)", "(move)", ")", ")"

Any help would be hugely appreciated!

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
Sven
  • 1,133
  • 1
  • 11
  • 22
  • 3
    I'm *really* struggling to derive your splitting algorithm from your examples, and I've been reverse engineering stuff for nearly 10 years. Admittedly, it's a Saturday morning and my hangover is still raging, but it'd be nice to get a bit of explanation. – Polynomial May 12 '12 at 10:36
  • Sorry for that, basically, I'm writing a kind of translator/compiler that converts a string in usable strings that in turn get turned to commands. The parentheses are used to determine a "block", a little bit like { } in java. – Sven May 12 '12 at 11:49

4 Answers4

1

How's this?

(?:\s*(?=\())|(?:(?<=\))\s*)

It relies on lookbehind though, so engines without lookbehind might not be able to handle this expression. :(

The rule being expressed is, split just before an opening parenthesis and just after a closing parenthesis, also cutting off any spaces on the outside of the parenthesis. The left part of the alternation thus matches spaces leading up to an opening paren; the right part will match spaces continuing after a closing paren.

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • Basically the same comment as above. Thank you. Sorry to bother you further, but how do I enter this? I get errors. (Working in eclipse). My "standard" method to use split goes like this: String pattern = "(?:\s*(?=\())|(?:(?<=\))\s*)" ; return path.split(pattern); But using your pattern like this causes errors. What am I doing wrong? – Sven May 12 '12 at 12:03
  • What @TimPietzcker said. Also, in the future, any time you have a question on a regexp, be sure to also add a tag for the programming language you're using, since regular expressions are not quite compatible across implementations. – Amadan May 12 '12 at 12:17
  • Will try to remember that. I immediately assumed this was in Java, since I always search this website if I have a java problem. Sorry for that! – Sven May 12 '12 at 12:21
1

Without lookbehind assertions: You can split on

\s*(?=\(|\B\))

This splits before an opening or closing parenthesis (including whitespace), but only if we're not at a word boundary before a closing parenthesis.

Input: (and (or (can-hit-robot) (wall) ) (can-hit-robot) (wall) ) )

Output:

(and 
(or 
(can-hit-robot) 
(wall) 
) 
(can-hit-robot) 
(wall) 
) 
)

Input: (while(wall)(if (can-hit-robot)(shoot)(move)))

Output:

(while
(wall)
(if 
(can-hit-robot)
(shoot)
(move)
)
)
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thank you. Sorry to bother you further, but how do I enter this? I get errors. (Working in eclipse). My "standard" method to use split goes like this: String pattern = "\s*(?=\(|\B\))" ; return path.split(pattern); But using your pattern like this causes errors. What am I doing wrong? – Sven May 12 '12 at 12:02
  • 1
    @Sven: In a Java string, you need to double the backslashes. – Tim Pietzcker May 12 '12 at 12:15
  • Thank you, just discovered that too. The String array that results from this does contain between each word an empty string "". Is that due to a faulty implementation on my end or supposed to happen? If it is supposed to happen, is there a way to prevent that from happening? – Sven May 12 '12 at 12:16
  • @Sven: I don't see this behaviour in PowerGREP, but it possibly discards empty splits automatically. The behaviour you're seeing actually makes sense because the whitespace token `\s` is optional, therefore the regex matches once with the space and once without. I don't think you can avoid this - another reason why regex is probably not the tool of choice for this. – Tim Pietzcker May 12 '12 at 14:43
0

You obviously have a grammar there. Don't go parsing it with regex, use a real parser.

Recommendations:

Or perhaps you should start by reading something about Parsing in the first place.

Otherwise, Cthulhu is calling

Community
  • 1
  • 1
Sean Patrick Floyd
  • 292,901
  • 67
  • 465
  • 588
0

Not really what you're asking for, but I think you'll be better off writing a proper parser. I think you then want to evaluate this expression somehow? You could then parse the input into a tree, which will make your evaluation much easier.

Taking the first example, (and (or (can-hit-robot) (wall) ) (can-hit-robot) (wall) ) ), a recursive descent parser would read the and, then find a new subexpression ((or (can-hit-robot) (wall) ) (can-hit-robot) (wall) )), begin a new child of and ((or (can-hit-robot) (wall) )), and so on.

carlpett
  • 12,203
  • 5
  • 48
  • 82