4

I'm looking to split space-delimited strings into a series of search terms. However, in doing so I'd like to ignore spaces within parentheses. For example, I'd like to be able to split the string

a, b, c, search:(1, 2, 3), d

into

[[a] [b] [c] [search:(1, 2, 3)] [d]]

Does anyone know how to do this using regular expressions in Java?

Thanks!

Jack
  • 43
  • 1
  • 1
  • 3
  • It can quickly get tricky: would *"a, (, c, ), search:(1, 2, 3), d"* be a valid input, for example? – SyntaxT3rr0r Jul 19 '10 at 22:23
  • I check the content of the search after splitting it into its constituent terms. I err on the side of rejecting things, so I'd like the above string to be split into [a] [(, c, )] [search:(1, 2, 3)] [d] . Then I'd just notice elsewhere that (, c, ) isn't a valid term and reject the overall search. – Jack Jul 21 '10 at 19:46

2 Answers2

3

This problem had another solution that wasn't mentioned, so I'll post it here for completion. This situation is similar to this question to ["regex-match a pattern, excluding..."][4]

We can solve this with a beautifully-simple regex:

\([^)]*\)|(\s*,\s*)

The left side of the alternation | matches complete (parentheses). We will ignore these matches. The right side matches and captures commas and surrounding spaces to Group 1, and we know they are the right apostrophes because they were not matched by the expression on the left. We will replace these commas by something distinctive, then split.

This program shows how to use the regex (see the results at the bottom of the online demo):

import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;

class Program {
public static void main (String[] args) throws java.lang.Exception  {

String subject = "a, b, c, search:(1, 2, 3), d";
Pattern regex = Pattern.compile("\\([^)]*\\)|(\\s*,\\s*)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • For my use case I used "\\([^)]*\\)|(.*,.*)" because there wasn't always whitespace around the , – David Thielen Mar 06 '20 at 13:45
  • Thanks for sharing this tip on using the `|` to specify a second capture group. I was able to use this method to solve a situation where I needed to find comma's that were not in parenthesis (even nested ones). I posted a question and answer [here](https://stackoverflow.com/q/62806431/1898524). – Ben Jul 09 '20 at 02:35
3

This isn't a full regex, but it'll get you there:

(\([^)]*\)|\S)*

This uses a common trick, treating one long string of characters as if it were a single character. On the right side we match non-whitespace characters with \S. On the left side we match a balanced set of parentheses with anything in between.

The end result is that a balanced set of parentheses is treated as if it were a single character, and so the regex as a whole matches a single word, where a word can contain these parenthesized groups.

(Note that because this is a regular expression it can't handle nested parentheses. One set of parentheses is the limit.)

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • 1
    +1, but since he wants neither commas nor zero-width matches, this would be closer: `(?:\([^)]*\)|[^,\s])` ([demo](http://regex101.com/r/yJ0jB2)) :) – zx81 Jun 16 '14 at 09:45