1

Does anyone know how to split a string on a character taking into account its escape sequence?

For example, if the character is ':', "a:b" is split into two parts ("a" and "b"), whereas "a:b" is not split at all.

I think this is hard (impossible?) to do with regular expressions.

Thank you in advance,

Kedar

Michael Myers
  • 188,989
  • 46
  • 291
  • 292
Kedar Mhaswade
  • 4,535
  • 2
  • 25
  • 34
  • See also http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas. – Michael Myers May 07 '09 at 21:23

2 Answers2

2

(?<=^|[^\\]): gets you close, but doesn't address escaped slashes. (That's a literal regex, of course you have to escape the slashes in it to get it into a java string)

(?<=(^|[^\\])(\\\\)*): How about that? I think that should satisfy any ':' that is preceded by an even number of slashes.

Edit: don't vote this up. MizardX's solution is better :)

Jeremy Huiskamp
  • 5,186
  • 5
  • 26
  • 19
  • The key is the (?<=foo) construct, positive look-behind. You need to check what precedes the ':' without matching it. – Jeremy Huiskamp May 07 '09 at 19:22
  • 1
    MizardX points out that look-behind needs to have a finite length. Mine doesn't so I guess it wouldn't work (have not tested). I believe our solutions are otherwise similar. His is probably better in that it uses negative look-behind to check for a non-slash character, whereas I use "^|[^\\\]" which may or may not act differently in multi-line scenarios (not sure). – Jeremy Huiskamp May 07 '09 at 19:43
  • 1
    (^|[^\\\]) should work. ^ could possibly match start of a line instead of a string. That's fine, since it still assures that it is not a backslash there. [^\\\] will also match newlines, so no problem when multi-line mode is not used either. – Markus Jarderot May 07 '09 at 20:08
2

Since Java supports variable-length look-behinds (as long as they are finite), you could do do it like this:

import java.util.regex.*;

public class RegexTest {
    public static void main(String[] argv) {

        Pattern p = Pattern.compile("(?<=(?<!\\\\)(?:\\\\\\\\){0,10}):");

        String text = "foo:bar\\:baz\\\\:qux\\\\\\:quux\\\\\\\\:corge";

        String[] parts = p.split(text);

        System.out.printf("Input string: %s\n", text);
        for (int i = 0; i < parts.length; i++) {
            System.out.printf("Part %d: %s\n", i+1, parts[i]);
        }

    }
}
  • (?<=(?<!\\)(?:\\\\){0,10}) looks behind for an even number of back-slashes (including zero, up to a maximum of 10).

Output:

Input string: foo:bar\:baz\\:qux\\\:quux\\\\:corge
Part 1: foo
Part 2: bar\:baz\\
Part 3: qux\\\:quux\\\\
Part 4: corge

Another way would be to match the parts themselves, instead of split at the delimiters.

Pattern p2 = Pattern.compile("(?<=\\A|\\G:)((?:\\\\.|[^:\\\\])*)");
List<String> parts2 = new LinkedList<String>();
Matcher m = p2.matcher(text);
while (m.find()) {
    parts2.add(m.group(1));
}

The strange syntax stems from that it need to handle the case of empty pieces at the start and end of the string. When a match spans exactly zero characters, the next attempt will start one character past the end of it. If it didn't, it would match another empty string, and another, ad infinitum…

  • (?<=\A|\G:) will look behind for either the start of the string (the first piece), or the end of the previous match, followed by the separator. If we did (?:\A|\G:), it would fail if the first piece is empty (input starts with a separator).
  • \\. matches any escaped character.
  • [^:\\] matches any character that is not in an escape sequence (because \\. consumed both of those).
  • ((?:\\.|[^:\\])*) captures all characters up until the first non-escaped delimiter into capture-group 1.
Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138