2

I have a question similar to How to split a string, but also keep the delimiters?. How would I split a String using a regex, keeping some types of delimiters, but not others? Specifically, I want to keep the non-whitespace delimiters, but not the whitespace delimiters.

To make this concrete:

"a;b c"        | ["a", ";", "b", "c"]
"a; ; bb c ;d" | ["a", ";", ";", "bb", "c", ";", "d"]

Can this be done cleanly with a regex, and if so how?

Right now I'm working around this by splitting on the character to keep, and then again on the other one. I can stick with this approach if the regex cannot do so, or cannot do so cleanly:

Arrays.stream(input.split("((?<=;)|(?=;))"))
        .flatMap(s -> Arrays.stream(s.split("\\s+")))
        .filter(s -> !s.isEmpty())
        .toArray(String[]::new); // In practice, I would generally use .collect(Collectors.toList()) instead
Community
  • 1
  • 1
M. Justin
  • 14,487
  • 7
  • 91
  • 130

6 Answers6

3

I suggest to capture what you want instead of splitting using this simple pattern

([^; ]+|;)

Demo

alpha bravo
  • 7,838
  • 1
  • 19
  • 23
  • While this doesn't answer the question asked about splitting on a regex, this may be the best answer to the underlying question of constructing the desired list of elements. It's simple, concise, easy to understand, and self explanatory. The other solutions require a fairly deep understanding of regexes, and a careful evaluation of the regex being used. However, I'm not sure I should mark it as the accepted answer, as the actual question of splitting the list also has merit on its own. – M. Justin Aug 21 '16 at 03:31
  • 1
    This would be the actual Java code for this solution after updating it to include all whitespace that the \s character class includes, and not just spaces: `Matcher matcher = Pattern.compile("([^; \t\n\u000B\f\r]+|;)").matcher(input); List matches = new ArrayList<>(); while(matcher.find()) { matches.add(matcher.group()); } return matches;`. Note that the actual Java code for this is longer than using split since the API doesn't provide a one-line mechanism for getting all groups. – M. Justin Aug 21 '16 at 03:32
2

You can do it this way:

System.out.println(String.join("-", "a; ; b c ;d".split("(?!\\G) *(?=;)|(?<=;) *| +")));

details:

(?!\\G)  # not contiguous to a previous match and not at the start of the string
[ ]*     # optional spaces
(?=;)    # followed by a ;
|    # OR
(?<=;)   # preceded by a ;
[ ]*     # optional spaces
|    # OR
[ ]+     # several spaces 

Feel free to change the literal space to \\s. To avoid an empty item (at the beginning of the resulting array when the string starts with a whitespace), you need to trim the string first.

Obviously, without the constraint of splitting, @alphabravo way is the most simple.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • This is great, but won't `"(?!\\G)\\s*"` alone do the trick? It certainly works for the examples given. – Alan Moore Aug 20 '16 at 23:48
  • @AlanMoore I updated the examples to show that I expect multiple contiguous non-whitespace, non-semicolon characters to be included in the same match result. This simplification will not work for the updated example. – M. Justin Aug 21 '16 at 03:22
2

I found a regex that works:

(\\s+)|((?<=;)(?=\\S)|(?<=\\S)(?=;))
public static void main(String argss[]){
    System.out.println(Arrays.toString("a; ; b c ;d"
        .split("(\\s+)|((?<=;)(?=\\S)|(?<=\\S)(?=;))")));
}

Will print out:

[a, ;, ;, b, c, ;, d]
Arthur
  • 1,246
  • 1
  • 15
  • 19
1

You want to split on whitespace, or between a letter and a non letter:

str.split("\\s+|(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • This doesn't quite answer the question as posed, since I don't actually care if it's a word character (\w being the same as [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}]), so much as whether it's a semicolon or not. Arthur 's solution above (http://stackoverflow.com/a/39059565/1108305) is effectively the same as this one, but only checks semicolons and whitespace. – M. Justin Aug 21 '16 at 03:37
1

After realizing Java doesn't support adding captured split char's to the
split array elements, thought I'd try a split solution without that
capability.

Basically there are only 4 permutations involving whitespace and the colon.
Finally, there is just the whitespace.

Here is the regex.

Raw: \s+(?=;)|(?<=;)\s+|(?<!\s)(?=;)|(?<=;)(?!\s)|\s+

Stringed: "\\s+(?=;)|(?<=;)\\s+|(?<!\\s)(?=;)|(?<=;)(?!\\s)|\\s+"

And the expanded regex with permutation's explained.
Good luck!

    \s+                  # Required, suck up wsp before ;
    (?= ; )              # ;

 |                     # or,

    (?<= ; )             # ;
    \s+                  # Required, suck up wsp after ;

 |                     # or,

    (?<! \s )            # No wsp before ;
    (?= ; )              # ;

 |                     # or,

    (?<= ; )             # ;
    (?! \s )             # No wsp after ;

 |                     # or,

    \s+                  # Required wsp

Edit

To stop a split on whitespace at BOS, use this regex.

Raw: \s+(?=;)|(?<=;)\s+|(?<!\s)(?=;)|(?<=;)(?!\s)|(?<!^)(?<!\s)\s+

Stringed: "\\s+(?=;)|(?<=;)\\s+|(?<!\\s)(?=;)|(?<=;)(?!\\s)|(?<!^)(?<!\\s)\\s+"

Explained:

    \s+                  # Required, suck up wsp before ;
    (?= ; )              # ;

 |                     # or,

    (?<= ; )             # ;
    \s+                  # Required, suck up wsp after ;

 |                     # or,

    (?<! \s )            # No wsp before ;
    (?= ; )              # ;

 |                     # or,

    (?<= ; )             # ;
    (?! \s )             # No wsp after ;

 |                     # or,

    (?<! ^ )             # No split of wsp at BOS   
    (?<! \s )
    \s+                  # Required wsp
  • This almost works, but it includes initial spaces as an additional split (" a " -> ["", "a"] instead of ["a"]). – M. Justin Aug 21 '16 at 07:15
  • If you want to allow those additional spaces, just needs an assertion.[ –  Aug 21 '16 at 07:18
  • I'm not quite sure what you're saying. When I apply your regex using String.split() in Java to " a ", it gives two elements in the split list (the empty string and "a"). I would want and expect it to return just one ("a"). – M. Justin Aug 21 '16 at 07:20
  • I'll put up an edit for you, just a second. Split won't let you trim inline, but you can match the " a" as one element, then do a trim left on element 0. That something you can do ? –  Aug 21 '16 at 07:20
  • Actually, I think I was misunderstanding how Java splits Strings; trailing empty Strings are excluded, but not leading ones. – M. Justin Aug 21 '16 at 07:23
  • Ok, I've added a regex to disallow wsp split at BOS. But, this will make `" a;b" -> [" a", "b"]`, but you should be able to do a blind trim of element 0. The best that can be done with regex and split. –  Aug 21 '16 at 07:31
  • — Now that I realize that it's a quirk in the Java split API, I don't think there's value in working around the issue in the regex itself. I think either trimming the string first, or using a third-party splitting API (such as Guava's [Splitter](https://github.com/google/guava/wiki/StringsExplained#Splitter) would be the better approach. – M. Justin Aug 21 '16 at 17:55
  • I wouldn't use any 3rd party for this. The best approach is to first run global replace `^\s+|\s+$` with blank. Then use my first regex with split. –  Aug 22 '16 at 12:18
0

Borrowing @CasimiretHippolyte \G trick you may want to split on

\\s+|(?!\\G)()

Note: no delimiters are specified.

Update

Based on avoiding split on very first spaces:

(?m)(?<!^|\\s)(\\s+|)(?!$)
Community
  • 1
  • 1
revo
  • 47,783
  • 14
  • 74
  • 117
  • Why the empty capturing group "()" at the end? It seems to not be doing anything, and it appears to work just as correctly without it. – M. Justin Aug 21 '16 at 03:52
  • Yes capturing group is for verbosity only. Also I updated my answer to fit your new requirement. Please check. @M.Justin – revo Aug 21 '16 at 07:56