-3

When splitting a string, how can I make sure that if the delimiter is located between two characters then it won't be considered?

// Input
String string = "a,b,[c,d],e";
String[] split = string.split(",");
// Output
split[0] // "a"
split[1] // "b"
split[2] // "[c"
split[3] // "d]"
split[4] // "e"
// Required
split[0] // "a"
split[1] // "b"
split[2] // "[c,d]"
split[3] // "e"
spongebob
  • 8,370
  • 15
  • 50
  • 83
  • 1
    What did you attempt that didnt work? – Reimeus Aug 13 '14 at 18:03
  • Yep need an actual example of what you want to accomplish, and what failed. A shot in the dark though: \b is a word boundary, it might help you. Also \s is for any space (space, tab...) character, might help you too. – Xælias Aug 13 '14 at 18:05
  • It is difficult to tell without an example, but it may be that you will need to write a simple finite state machine parser if your requirements are sufficiently idiosyncratic. – David Conrad Aug 13 '14 at 18:06
  • Can brackets ever be nested? `a,b,[c,d,[e,f],g],h`? – David Conrad Aug 13 '14 at 18:24
  • What do you expect for repeated commas? `a,,b,[c,d],,e`? (Perhaps moot at this point.) – David Conrad Aug 13 '14 at 18:56
  • @DavidConrad There should not be. If you can, throw a custom exception, else, leave an empty string. – spongebob Aug 14 '14 at 06:48

2 Answers2

5

Preferred approach at the end of the answer

It seems you are looking for look-around mechanism.

For instance if you want to split on whitespace which has no foo before and no bar after it your code can look like

split("(?<!foo)\\s(?!bar)")

Update (assuming that there can't be any nested [...] and they are well formatted for instance all [ are closed with ]):

Your case seems little more complex. What you can do is accept , if

  • it doesn't have any [ or ] after it,
  • or if first opening bracket [ after this comma, has no closing bracket ] between this comma and itself, otherwise it would mean that comma is inside of area like

    [ , ] [
      ^ ^ ^ - first `[` after tested comma
      | +---- one `]` between tested comma and first `[` after it
      +------ tested comma
    

So your code can look like
(this is original version, but below is little simplified one)

split(",(?=[^\\]]*(\\[|$))")

This regex is based on idea that commas you don't want to accept are inside [foo,bar]. But how to determine that we are inside (or outside) such block?

  1. if character is inside then there will be no [ character after it, until we find ] (next [ can appear after found ] like in case [a,b],[c,d] comma between a and b has no [ until it finds ], but there can be some new area [..] after it which ofcourse starts with [)
  2. if character are outside [...] area then next after it can appear only non ] characters, until we find start of [...] area, or we will read end of string.

Second case is the one you are interested in. So we need to create regex which will accept , which has only non ] after it (it is not inside [...]) until it finds [ or read end of string (represented by $)

Such regex can be written as

  • , comma
  • (?=...) which has after it
  • [^\\]]*(\\[|$)
    • [^\\]]* zero or more non ] characters (] need to be escaped as metacharacter)
    • (\\[|$) which have [ (it also needs to be escaped in regex) or end of string after it

Little simplified split version

string.split(",(?![^\\[]*\\])");

Which means: split on comma , which after it has no (represented by (?!...)) unclosed ] (unclosed ] has no [ between tested comma and itself which can be written as [^\\[]*\\])


Preferred approach

To avoid such complex regex don't use split but Pattern and Matcher classes, which will search for areas like [...] or non-comma words.

String string = "a,b,[c,d],e";
Pattern p = Pattern.compile("\\[.*?\\]|[^,]+");
Matcher m = p.matcher(string);
while (m.find())
    System.out.println(m.group());

Output:

a
b
[c,d]
e
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • @Joiner Which parts should I explain? [Character class](http://www.regular-expressions.info/charclass.html) `[..]`? Negated character class `[^..]`? `OR` operator `|`? End of string [anchor](http://www.regular-expressions.info/anchors.html) `$`? – Pshemo Aug 14 '14 at 10:04
  • For example, I can't understand `(?<!foo)\\s(?!bar)`: `(?!bar)` does not have the `>` character, why? And I know `\s`, not `\\s`. See Mati Cicero's edit. – spongebob Aug 14 '14 at 11:57
  • @Joiner One is negative-look-behind `(?<!...)` and the other without `<` is negative-look-ahead `(?!...)` it is all explained in link about look-around mechanism I gave at start of my answer. – Pshemo Aug 14 '14 at 12:05
  • @Joiner "*And I know `\s`, not `\\s`*" to create ``\`` literal in Java String which you could pass to regex engine you need to escape it, so String representing ``\`` needs to be written as `"\\"`. That is why to create `\s` you need to write it as `"\\s"` (if that is what you asked). – Pshemo Aug 14 '14 at 12:09
2

A simple Regex will satisfy your needs:

(?<!\[\w),(?!\w\])

This regular expression means the following:

  • (?<!\[\w) = The match cannot be after a [x where x is any character
  • , = The match should be a comma
  • (?!\w\]) = The match cannot be before a x] where x is any character

You may use it as follows:

String[] split = text.split("(?<!\\[\\w),(?!\\w\\])");

Output:

a
b
[c,d]
e
Matias Cicero
  • 25,439
  • 13
  • 82
  • 154
  • `(?!\w\])`: I see `\w\]`. Where is `x]`? And why `(?<!\[\w)` is not `(?<![\w\)`? – spongebob Aug 14 '14 at 12:03
  • The `x` was just an example showing that it could be _any_ character. In Regular Expressions, _any character_ is expressed as **\w**. Also, in Regular Expressions, the **[** is a reserved keyword or character, and is used for ranges, so, in order to tell the compiler we want the **explicit [**, we should escape it by using the back slash, therefore we shall write it as `\[` – Matias Cicero Aug 14 '14 at 12:07
  • `"a":["b","c"]` split with `(?<!\\[\\w),(?!\\w\\])` returns `"a":["b"` **and** `"c"]`. – spongebob Aug 14 '14 at 12:19
  • I formulated my Regular Expression to be compatible with your provided code `a,b,[c,d],e` – Matias Cicero Aug 14 '14 at 12:25